1 R Setup and Required Packages

In this project, the open-source R programming language is used to model the progression in the COVID-19 pandemic in different U.S. counties. R is maintained by an international team of developers who make the language available at The Comprehensive R Archive Network. Readers interested in reusing our code and reproducing our results should have R installed locally on their machines. R can be installed on a number of different operating systems (see Windows, Mac, and Linux for the installation instructions for these systems). We also recommend using the RStudio interface for R. The reader can download RStudio for free by following the instructions at the link. For non-R users, we recommend the Hands-on Programming with R for a brief overview of the software’s functionality. Hereafter, we assume that the reader has an introductory understanding of the R programming language.

In the code chunk below, we load the packages used to support our analysis. Note that the code of this and any of the code chunks can be hidden by clicking on the ‘Hide’ button to facilitate the navigation. The reader can hide all code and/or download the Rmd file associated with this document by clicking on the Code button on the top right corner of this document. Our input and output files can also be accessed/ downloaded from fmegahed/covid19.

if(require(pacman)==FALSE) install.packages("pacman") # check to see if the pacman package is installed; if not install it
if(require(devtools)==FALSE) install.packages("devtools") # check to see if the devtools package is installed; if not install it

# to check and install if these packages are not found locally on machine
if(require(albersusa)==FALSE) devtools::install_github('hrbrmstr/albersusa') #install package if needed
if(require(albersusa)==FALSE) devtools::install_github('dreamRs/r2d3maps') #install package if needed


# check if packages are not installed; if yes, install missing packages
pacman::p_load(tidyverse, magrittr, janitor, dataPreparation, lubridate, skimr, # for data analysis
               COVID19, rvest, readxl, # for extracting relevant data
               DT, pander, stargazer, knitr, # for formatting and nicely printed outputs
               scales, RColorBrewer, DataExplorer, tiff, grid,# for plots
               plotly, albersusa, tigris, leaflet, tmap, # for maps
               zoo, fpp2, NbClust, # for TS analysis and clustering
               VIM, rgdal,  spdep,  nimble, fastDummies, matrixStats, # for spatial regression 
               nnet, caret, # for multinomial regression modeling
               conflicted) # for managing conflicts in functions with same names

# Handling conflicting function names from packages
conflict_prefer('combine', 'dplyr') # Preferring dplyr::combine over any other package
conflict_prefer('select', "dplyr") #Preferring dplyr::select over any other package
conflict_prefer("summarize", "dplyr") # similar to above but with dplyr::summarize
conflict_prefer("filter", "dplyr") # Preferring filter from dplyr
conflict_prefer("dist", "stats") # Preferring dist from stats
conflict_prefer("as.dist", "stats") # Preferring as.dist from stats

# Custom Functions
source_url('https://raw.githubusercontent.com/fmegahed/covid19-deaths/master/Markdown/custom_functions.R')

set.seed(2020) # to assist with reproducibility
sInfo = sessionInfo() # saving all the packages/functions and session info

2 Extracting the Datasets

For our analysis, we fuse data from multiple sources. We describe the process of obtaining and merging each of these sources in the subsections below.

2.1 Time Series Data

In this section, we utilize the COVID19 package to obtain the following information: (Guidotti & Ardia, 2020)

  • Confirmed cases, recoveries and deaths;
  • policy information (e.g., transport closing, school closing, closing event, movement restrictions, testing policies, and contact tracing); and
  • Population and standard geographic information for each county.

From this information, we have also computed the new daily and weekly confirmed cases/deaths per county. The data is stored in a tidy format, but can be expanded to a wide format using pivot_wider() from the tidyverse package.

endDate = '2021-01-02'
endDatePrintV = format(ymd(endDate), format = "%b %d, %Y")

counties = covid19(country = "US", 
                   level = 3, # for county
                   start = "2020-03-01", # First Sunday in March
                   end = endDate, # end Date 
                   raw = FALSE, # to ensure that all counties have the same grid of dates
                   amr = NULL, # we are not using the apple mobility data for our analysis
                   gmr = NULL, # we are not using the Google mobility data for our analysis
                   wb = NULL, # world bank data not helpful for county level analysis
                   verbose = FALSE)

counties %<>% # next line removes non-contiguous US states/territories
  filter(!administrative_area_level_2 %in% c('Alaska', 'Hawaii', 'Puerto Rico', 'Northern Mariana Islands', 'Virgin Islands')) %>% 
  fast_filter_variables(verbose = FALSE) %>% #dropping invariant columns or bijections
  filter(!is.na(key_numeric)) %>%  # these are not counties
  group_by(id) %>% # grouping the data by the id column to make computations correct
  arrange(id, date) %>% # to ensure correct calculations
  mutate(day = wday(date, label = TRUE) %>% factor(ordered = F), # day of week
         newCases = c(NA, diff(confirmed)), # computing new daily cases per county
         newDeaths = c(NA, diff(deaths)) )  # computing new daily deaths per county

# manually identifying factor variables
factorVars = c("school_closing", "workplace_closing", "cancel_events",
               "gatherings_restrictions", "transport_closing", "stay_home_restrictions",
               "internal_movement_restrictions", "international_movement_restrictions",
               "information_campaigns", "testing_policy", "contact_tracing")

counties %<>% # converting those variables into character and then factor
  mutate_at(.vars = vars(any_of(factorVars)), .funs = as.character) %>% 
  mutate_at(.vars = vars(any_of(factorVars)), .funs = as.factor)

# Saving the data into an RDS file
saveRDS(counties, paste0("../Data/counties.rds"))

2.2 Cross Sectional Data

In the code chunk below, we obtain seven additional datasets, whose variables can explain the differences between the time-series of the number of COVID cases per county:

A. Rural/ Underserved Counties: From the Consumer Financial Protection Bureau, we have obtained the Final 2020 List titled: Rural or underserved counties. Per the website, the procedure for determining the classification of a county is as follows: “Beginning in 2020, the rural or underserved counties lists use a methodology for identifying underserved counties described in the Bureau’s interpretive rule: Truth in Lending Act (Regulation Z); Determining “Underserved” Areas Using Home Mortgage Disclosure Act Data.”

B. Based on the US Census Data, we extracted the land area in square miles for each county, which we combined with population to compute each county’s population density, which we hypothesize to be predictive of hotspots for COVID transmission based on the available COVID-19 literature.

C. Based on the MIT Election Data and Science Lab (2018), we have obtained the voting results for all counties in the 2016 Presidential elections. The data was used to compute the percentage of total votes that went to President Trump, with the underlying hypothesis that the politicization of COVID response (e.g., perception/willingness to use face masks, policies and the population’s reaction to the disease) may be explained by party affiliation.

D. We extracted an overall government response index capturing the strength of COVID-19 response policies on a state (and the District of Columbia) level from the Blavatnik School of Government’s GitHub Repository. This index captures 13 different indicators, capturing the ``full range of government response’’. Details for how this indicator is computed can be found at BSG-WP-2020/034.

E. Based on the following Kaiser Health News Webpage, we extracted by county information on the percent of population aged 60+ and the number of ICU beds per Seniors.

F. We have engineered a region variable based on the CDC’s 10 Regions Framework. While geographic regions are hypothesized to be a factor in disease outbreaks, we chose to utilize the CDC regions specifically based on the following explanation from the aforementioned link:
> “CDC’s National Center for Chronic Disease Prevention and Health Promotion (NCCDPHP) is strengthening the consistency and quality of the guidance, communications, and technical assistance provided to states to improve coordination across our state programs”

G. Based on the Census’s Small Area Income and Poverty Estimates (SAIPE) Program, we extracted the estimate for the percent of population in poverty. The estimate is based on 2018 data (released in December 2019). At the time of the start of our analysis, these estimates were the most up to date publicly available data.

crossSectionalData = counties %>% ungroup() %>% 
  select(id, key_numeric, key_google_mobility, population,
         administrative_area_level_2, administrative_area_level_3) %>%
  unique()

# [A] Rural or Urban Classification of the County
ru = read.csv("https://www.consumerfinance.gov/documents/8911/cfpb_rural-underserved-list_2020.csv")
ru %<>%  transmute(key_numeric = FIPS.Code, #renaming FIPS.Code to key_numeric 
                countyType = "Rural/Underserved") # creates two vars and drop old vars
crossSectionalData = merge(crossSectionalData, ru, by = "key_numeric", all.x = TRUE) # to define NA counties
crossSectionalData$countyType %<>% replace_na("Other") # for any county not in the Consumer FIN data replace NA by Other


# [B] Population Density of Each County
download.file("https://www2.census.gov/library/publications/2011/compendia/usa-counties/excel/LND01.xls",
              destfile = "../Data/LND01.xls", mode = "wb") # downloading Land Area Data Per the 2010 Census
areas = read_excel("../Data/LND01.xls") %>% # reading the Excel file
  select(STCOU, LND110210D) #selecting only the FIPS and the Land Area from the 2010 Census variables
colnames(areas) = c("key_numeric", "LandAreaSqMiles2010") # Renaming the columns
areas$key_numeric %<>% as.numeric() # to remove leading 0 

crossSectionalData = merge(crossSectionalData, areas, by ="key_numeric", all.x = TRUE) # adding the area to data frame
crossSectionalData$popDensity = crossSectionalData$population / crossSectionalData$LandAreaSqMiles2010 # creating the population density variable
crossSectionalData %<>% select(-c(population, LandAreaSqMiles2010)) #dropping two variables used in creating pop density 


# [C] 2016 Presidential Elections County Data from Harvard https://doi.org/10.7910/DVN/VOQCHQ
elections = read.csv("../Data/countypres_2000-2016.csv") %>% # reading the downloaded CSV
  filter(year == 2016 & party == "republican") %>% # just keeping data for recent election and republican votes
  mutate(key_numeric = FIPS, # renaming FIPS to key_numeric
         percRepVotes = 100*(candidatevotes/totalvotes) ) %>% # computing percent of republican votes (from total votes)
  select(key_numeric, percRepVotes) # keeping only the key and variable used in merge
crossSectionalData %<>%  merge(elections, by = "key_numeric", all.x = TRUE) # merge with the counties data


# [D] Policy Data
policy = read_csv('https://raw.githubusercontent.com/OxCGRT/USA-covid-policy/master/data/OxCGRT_US_latest.csv')
policy = filter(policy, !is.na(RegionName) | !RegionName %in% c('Alaska', 'Hawaii'))
policy$state = toupper(policy$RegionName) # a state variable = an upper case of existing RegionName
policy$Date %<>% ymd() # converting the Date data to a date format

policySummary = policy %>% # calculating a summary table of median value for the GovernmentResponseIndex per state
  filter(Date >= '2020-03-01' & Date <= endDate) %>% # to match our COVID Data timeSeries
  group_by(state) %>% # perform computations using the median value, per state, for each index
  summarise(GovernmentResponseIndexMedian = median(GovernmentResponseIndex, na.rm = TRUE))
policySummary$state %<>%  str_replace('WASHINGTON DC', 'DISTRICT OF COLUMBIA') %>% str_to_title()

crossSectionalData %<>%  merge(policySummary, by.x = "administrative_area_level_2", by.y = 'state', all.x = TRUE) 


# [E] Kaiser Health News Data on the County Level
hospitals = read.csv("../Data/data-FPBfZ.csv") %>% # downloaded from KHN on 2020-10-26 (~9:30 pm EDT)
  transmute(State = State, # keeping the State Variable | transmute drops variables that are not in call
            County = County, # keeping the County Variable
            PercentSeniors = Percent.of.Population.Aged.60., # Shortening Original Variable Name
            icuBedsPer10000Seniors = 10000 * ICU.Beds/Population.Aged.60.) # Computing icuBedsPer10000Seniors

crossSectionalData %<>% merge(hospitals, 
                              by.x = c("administrative_area_level_2", "administrative_area_level_3"),
                              by.y = c("State", "County"), all.x = TRUE)


# [F] CDC Regions for Each State
regionsCDC = data.frame(States = c('Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 'Rhode Island' , 
                                   'Vermont', 'New York', # End of Region A
                                   'Delaware', 'District of Columbia', 'Maryland', 'Pennsylvania',
                                   'Virginia', 'West Virginia', 'New Jersey', # End of Region B
                                   'North Carolina', 'South Carolina', 'Georgia', 'Florida', # Region C
                                   'Kentucky', 'Tennessee', 'Alabama', 'Mississippi', # Region D
                                   'Illinois', 'Indiana', 'Michigan', 'Minnesota', 'Ohio',
                                   'Wisconsin', # End of Region E
                                   'Arkansas', 'Louisiana', 'New Mexico', 'Oklahoma', 'Texas', # Region F
                                   'Iowa', 'Kansas', 'Missouri', 'Nebraska', # Region G
                                   'Colorado', 'Montana', 'North Dakota', 'South Dakota',
                                   'Utah', 'Wyoming', # End of Region H
                                   'Arizona', 'California', 'Hawaii', 'Nevada', # Region I
                                   'Alaska', 'Idaho', 'Oregon', 'Washington' # Region J
                                   ),
                        regions = c(rep('A', 7), rep('B', 7), rep('C', 4),
                                    rep('D', 4), rep('E', 6), rep('F', 5),
                                    rep('G', 4), rep('H', 6), rep('I', 4),
                                    rep('J', 4) ) )

crossSectionalData %<>% merge(regionsCDC, by.x = 'administrative_area_level_2', by.y = 'States', all.x = TRUE) # merge


# [G] Poverty Estimates
download.file("https://www2.census.gov/programs-surveys/saipe/datasets/2018/2018-state-and-county/est18all.xls", 
              destfile = "../Data/est18all.xls", mode = "wb") # downloading the data for poverty estimates (latest 2018)

poverty = read_excel("../Data/est18all.xls", skip = 3) %>% # reading the data in R
  transmute(key_numeric = paste0(`State FIPS Code`, `County FIPS Code`) %>% as.numeric, # creating the key from two variables
            povertyPercent = as.numeric(`Poverty Percent, All Ages`) ) # shortening povertyPercent Variable's Name
crossSectionalData %<>% merge(poverty, by = "key_numeric", all.x = TRUE) # merge


# Final Transformations before Saving the Counties Data
crossSectionalData %<>%  mutate_at(.vars = c('countyType', 'regions'), as.factor)  # converting the two vars to factor

# Saving the data into an RDS file
saveRDS(crossSectionalData, paste0("../Data/crossSectionalData.rds"))

# Tabulating the results and providing a way to export the table to different formats
datatable(crossSectionalData %>% select(-c(id, key_numeric, administrative_area_level_2, administrative_area_level_3)),
          extensions = c('FixedColumns', 'Buttons'), options = list(
            dom = 'Bfrtip',
            scrollX = TRUE,
            buttons = c('copy', 'csv', 'excel', 'pdf'),
            fixedColumns = list(leftColumns = 1)),
          rownames = FALSE) %>% 
  formatRound(columns= c('popDensity', 'percRepVotes', 'GovernmentResponseIndexMedian',
                         'PercentSeniors', 'icuBedsPer10000Seniors', 'povertyPercent'),
              digits=1)

2.3 Exploratory Analysis

In this section, we perform an exploratory analysis on the data obtained from the multiple sources.

2.3.1 Cumulative Cases

noGoogleNAs = filter(crossSectionalData, !is.na(key_google_mobility)) # removing NAs from key_google_mobility
idIndex = sample(noGoogleNAs$id, 9) # sampling 9 counties by id

# Saving cumulative deaths figure to an tiff file
tiff(filename = '../Figures/sampleCumulativeCases.tiff',
    width = 1366, height =768, pointsize = 16)
counties %>% filter(id %in% idIndex) %>% 
  ggplot(aes(x = date, y = confirmed, group = id, color = key_google_mobility)) +
  geom_line(size = 1.25) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  facet_wrap(~ key_google_mobility, scales = 'free_y', ncol = 3) +
  theme(legend.position = 'none') + 
  labs(color = '', x = 'Month', y = 'Cumulative Cases By County',
       caption = paste0('Based on Data from March 01, 2020 - ', endDatePrintV)) +
  scale_color_brewer(type = 'qual', palette = 'Paired')
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Creating an interactive plot for the markdown
p = ggplot2::last_plot() + geom_line(size = 0.75) + # modifying the plot for plotly
  theme_bw(base_size = 9) + theme(legend.position = 'none') # to make margins smaller
ggplotly(p, height = 768) %>%  layout_ggplotly()

2.3.2 New Daily Cases

# Saving new daily cases figure to an tiff file
tiff(filename = '../Figures/sampleNewDailyCases.tiff',
    width = 1366, height =768, pointsize = 16)
counties %>% filter(id %in% idIndex) %>% 
  ggplot(aes(x = date, y = newCases, group = id, color = key_google_mobility)) +
  geom_line(size = 1.25) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  facet_wrap(~ key_google_mobility, scales = 'free_y', ncol = 3) +
  theme(legend.position = 'none') + 
  labs(color = '', x = 'Month', y = 'New Daily Cases By County',
       caption = paste0('Based on Data from March 01, 2020 - ', endDatePrintV)) +
  scale_color_brewer(type = 'qual', palette = 'Paired')
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Creating an interactive plot for the markdown
p = ggplot2::last_plot() + geom_line(size = 0.75) + # modifying the plot for plotly
  theme_bw(base_size = 9) + theme(legend.position = 'none') # to make margins smaller
ggplotly(p, height = 768) %>%  layout_ggplotly()

2.3.3 County Types

crossSectionalData$fips = str_pad(crossSectionalData$key_numeric,
                                     width = 5, side = 'left', pad = '0')
# Retrieving the U.S. county composite map as a simplefeature
cty_sf = counties_sf("aeqd") %>% filter(!state %in% c('Alaska', 'Hawaii')) # from albersua
cty_sf %<>% geo_join(crossSectionalData, by_sp= 'fips', by_df= 'fips')

# Saving a higher quality tiff file for use in the paper
tiff(filename = '../Figures/countyTypes.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('countyType', title = 'County Type', palette = "Paired")
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Printing the png output in the Markdown doc
tm_shape(cty_sf) + tm_polygons('countyType', title = 'County Type', palette = "Paired")

2.3.4 Population Density

# Saving a higher quality tiff file for use in the paper
tiff(filename = '../Figures/popDensity.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('popDensity', title = 'Population Density', palette = "Greens",
                               style = 'quantile')
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Printing the png output in the Markdown doc
tm_shape(cty_sf) + tm_polygons('popDensity', title = 'Population Density', palette = "Greens",
                               style = 'quantile')

2.3.5 Percent Republican Votes

# Saving a higher quality tiff file for use in the paper
tiff(filename = '../Figures/repVotes.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('percRepVotes', title = '% Republican Votes', palette = "Reds")
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Printing the png output in the Markdown doc
tm_shape(cty_sf) + tm_polygons('percRepVotes', title = '% Republican Votes', palette = "Reds")

2.3.6 Government Response Data

state_sf = usa_sf("aeqd") %>% filter(!name %in% c('Alaska', 'Hawaii')) # from albersua
state_sf %<>% geo_join(crossSectionalData, by_sp= 'name', by_df= 'administrative_area_level_2')

# Saving a higher quality tiff file for use in the paper
tiff(filename = '../Figures/govResponse.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(state_sf) + tm_polygons('GovernmentResponseIndexMedian', 
                                 title = 'Median Value of the Government Response Index', palette = "-Greens")
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Printing the png output in the Markdown doc
tm_shape(state_sf) + tm_polygons('GovernmentResponseIndexMedian', 
                                 title = 'Median Value of the Government Response Index', palette = "-Greens")

2.3.7 Percent Seniors

# Saving a higher quality tiff file for use in the paper
tiff(filename = '../Figures/percSeniors.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('PercentSeniors', title = '% Seniors', palette = "Greens")
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Printing the png output in the Markdown doc
tm_shape(cty_sf) + tm_polygons('PercentSeniors', title = '% Seniors', palette = "Greens")

2.3.8 CDC Regions

# Saving a higher quality tiff file for use in the paper
tiff(filename = '../Figures/cdcRegions.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(state_sf) + tm_polygons('regions', title = 'CDC Region', palette = "Paired")
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Printing the png output in the Markdown doc
tm_shape(state_sf) + tm_polygons('regions', title = 'CDC Region', palette = "Paired")

2.3.9 Percent Poverty

# Saving a higher quality tiff file for use in the paper
tiff(filename = '../Figures/povertyPercent.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('povertyPercent', title = 'Poverty Percent', palette = "Greens")
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Printing the png output in the Markdown doc
tm_shape(cty_sf) + tm_polygons('povertyPercent', title = 'Poverty Percent', palette = "Greens")

3 Time-Series Clustering

It is important to note that, in our estimation, there are three important decisions to be made when performing time-series clustering:

  • Preparation of the Different Time-Series to be Clustered In this section, we have (a) selected the new daily cases per county as the primary variable of interest, (b) smoothed that variable using a seven-day moving average, and (c) scaled the observations within each county’s 7-day MA of new daily deaths such that it is bounded between 0 and 1. This allows us to compare the shape of the time-series/profile across counties of different populations and where the magnitude of the cases is quite different.

  • Choice of Distance Measure: The Euclidean distance, The Euclidean Distance i.e., the \(l_2\) norm, is the most commonly used distance measure since it is computationally efficient. However, it may not be suitable for applications where the time-series are of different length in addition to being sensitive to noise, scale and time shifts (Sardá-Espinosa, 2017).

  • Choice of Clustering Algorithm: A large number of clustering algorithms have been proposed in the literature. Most common clustering approaches are shape-based, which include \(k-\)means clustering and hierarchical clustering. The reader is referred to Aghabozorgi et al. (2015) for a detailed review. In our preliminary analysis, we have chosen to use the hierarchical clustering approach since it provides an easy to understand dendogram and the number of counties was small. However, in our full analysis, we will use the \(k-\)means clustering algorithm since it is computationally efficient. Furthermore, we overcame the traditional limitation of having to pre-specify \(k\) by utilizing 26 indexes for determining the optimal number of clusters in a data set based on the excellent approach and package implementation of Charrad et al. (2014).

3.1 Data Preparation

clusteringPrep = counties %>% # from the counties
  select(id, date, key_google_mobility, newCases) %>% # selecting minimal amount of cols for visual inspection
  arrange(id, date) %>% # arranged to ensure correct calculations
  mutate(newMA7 = rollmeanr(newCases, k = 7, fill = NA), # 7-day ma of new (adjusted) cases
         maxMA7 = max(newMA7, na.rm = T), # obtaining the max per county to scale data
         scaledNewMA7 = pmax(0, newMA7/maxMA7, na.rm = TRUE) ) %>% # scaling data to a 0-1 scale by county
  select(id, key_google_mobility, date, scaledNewMA7) %>% # dropping the variable newCases
  pivot_wider(names_from = date, values_from = scaledNewMA7) # converting the data to a wide format for clustering

constantColumns  = which_are_constant(clusteringPrep, verbose = F) # identifying constant columns
datesDropped = colnames(clusteringPrep)[constantColumns] # used for printing the names after the code chunk

clusteringPrep %<>% select(-all_of(constantColumns) ) %>%  # speeds up clustering by dec length of series
  as.data.frame() # data needs to be data frame for clustering
row.names(clusteringPrep) = clusteringPrep[,1] # needed for tsclust
clusteringPrep = clusteringPrep[,-1] # dropping the id column since it is now row.name

The following dates were removed from our data frame since the scaledNEWMA7 variable was constant across all counties: 2020-03-01, 2020-03-02, 2020-03-03, 2020-03-04, 2020-03-05, 2020-03-06 and 2020-03-07.

3.2 Clustering Contiguous U.S. Counties

clusteringPrep %<>% select(-c(key_google_mobility))  # removing this variable so we can cluster

nc  = NbClust(clusteringPrep, distance = "euclidean", # euclidean distance
             min.nc = 2, max.nc = 49, # searching for optimal k between k=2 and k=49
             method = "kmeans", # using the k-means method
             index = "all") # using 26 of the 30 indices in the package
## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 
## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 3 proposed 3 as the best number of clusters 
## * 7 proposed 4 as the best number of clusters 
## * 1 proposed 6 as the best number of clusters 
## * 1 proposed 11 as the best number of clusters 
## * 2 proposed 13 as the best number of clusters 
## * 1 proposed 27 as the best number of clusters 
## * 1 proposed 44 as the best number of clusters 
## * 1 proposed 49 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  4 
##  
##  
## *******************************************************************
kclus  = nc$Best.partition %>% as.data.frame() %>% #obtaining the best partition/ cluster assignment for optimal k
  rename(., cluster_group = .) %>% rownames_to_column("County") 

#converting the wide to tall data and adding the cluster groupings
clusters  = clusteringPrep %>% 
  rownames_to_column(var = "County") %>% 
  pivot_longer(cols = starts_with("2020"), names_to = "Date") %>% 
  inner_join(., kclus, by = "County") %>% 
  mutate(cluster_group = as.factor(cluster_group))

idClusters  = clusters %>% select(c(County, cluster_group)) # creating a look-up table of county and cluster group
colnames(idClusters)  = c('id', 'cluster_group') # renaming the columns
idClusters %<>%  unique() #removing the duplicates due to different dates (we had that to ensure that the clustering was applied correctly)

# Adding Cluster Grouping to a subset of the counties data frame
clusterCounties = counties %>% 
  select(c(id, key_numeric, key_google_mobility, administrative_area_level_2, administrative_area_level_3)) %>% 
  inner_join(., idClusters, by ='id') %>% 
  mutate(cluster_group = paste0('C', cluster_group)) %>% 
  unique()

# saving the results as a RDS File
saveRDS(clusterCounties, '../Data/clusterCounties.rds')

3.3 Visualizing the Clustering Results

In this subsection, we provide three plots:

  • A paneled spaghetti plot, highlighting the median scaled time-series for profile for each cluster;
  • A panel plot where the first, second and third quartiles of the scaled time-series for each cluster are compared; and
  • An interactive chloropleth maps to visualize the spatial distribution of the clusters, where the reader can click on a given county to show: (a) county name, (b) assigned cluster, (c) population density, and (d) percentage of residents in poverty.

3.3.1 Spaghetti Plot

spaghettiDF = counties %>% # from the counties
  select(id, date, newCases, key_google_mobility) %>% # selecting minimal columns
  left_join(clusterCounties[, c('id', 'cluster_group')], by = 'id') %>% # to get clusters
  arrange(id, date) %>% # arranged to ensure correct calculations
  mutate(newMA7 = rollmeanr(newCases, k = 7, fill = NA), # 7-day ma of new (adjusted) deaths
         maxMA7 = max(newMA7, na.rm = T), # obtaining the max per county to scale data
         scaledNewMA7 = pmax(0, newMA7/maxMA7, na.rm = TRUE) ) %>% 
  ungroup() %>% select(date, cluster_group, scaledNewMA7, key_google_mobility) %>% 
  group_by(date, cluster_group)

spaghettiDF$cluster_group %<>% as.factor() 

# Creating a Named Color Scale
colorPal =  brewer.pal(n= levels(spaghettiDF$cluster_group) %>% length(), 'Set2')
names(colorPal) = levels(spaghettiDF$cluster_group)

# Saving spaghetti plot to an tiff file
tiff(filename = '../Figures/spaghettiPlot.tiff', width = 1366, height =768, pointsize = 16)
spaghettiDF %>%  
  ggplot(aes(x = date, y = scaledNewMA7, color = cluster_group, group = key_google_mobility)) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  geom_line(size = 0.25, alpha = 0.1) +
  stat_summary(aes(group = 1), 
               fun= median,
               geom = "line",
               size = 1.25, col = 'black') + 
  facet_wrap(~ cluster_group, ncol = 1) +
  theme(legend.position = 'none') + 
  labs(x = 'Month', y = 'Scaled New Cases By Cluster By Day',
       caption = paste0('Solid black line represents the median for each cluster | 
       Based on Data from March 01, 2020 - ', endDatePrintV) )  +
  scale_color_manual(values = colorPal)
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Printing a png version of the plot in Markdown (lower quality image for quicker compilation of HTML)
readTIFF("../Figures/spaghettiPlot.tiff") %>% grid.raster()

3.3.2 Summary Plot

# Creating a data frame containing statistical summaries of the time series by cluster_group
summaryDf = spaghettiDF %>% 
  summarise(Median = median(scaledNewMA7, na.rm= TRUE),
            `First Quartile` = quantile(scaledNewMA7, probs = 0.25, na.rm= TRUE),
            `Third Quartile` = quantile(scaledNewMA7, probs = 0.75, na.rm= TRUE)) %>% 
  pivot_longer(cols = c(`First Quartile`, Median, `Third Quartile`),
                        names_to = 'Statistic')

tiff(filename = '../Figures/summaryPlot.tiff', width = 1366, height =768, pointsize = 16)
summaryDf %>% 
  ggplot(aes(x = date, y = value, color = cluster_group, linetype =  Statistic)) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  geom_line(size = 1.25) +
   scale_linetype_manual(values = c('dotted', 'solid', 'twodash')) +
  facet_wrap(~ cluster_group, ncol = 1) +
  theme(legend.position = 'top') + 
  labs(color = '', x = 'Month', y = 'Quartiles of Scaled New Cases By Cluster By Day',
       caption = paste0('Based on Data from March 01, 2020 - ', endDatePrintV)) +
  scale_color_manual(values = colorPal)
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Printing a png version of the plot in Markdown (lower quality image for quicker compilation of HTML)
readTIFF("../Figures/summaryPlot.tiff") %>% grid.raster()

3.3.3 Cluster Map

# Joining the clusterCounties results with the existing county simple features object (cty_sf)
clusterCounties$fips = str_pad(clusterCounties$key_numeric, width = 5, side = 'left', pad = '0')
clusterCounties %<>% ungroup()
cty_sf %<>% left_join(clusterCounties[, c('fips', 'cluster_group')], by = 'fips') # adding cluster_group to cty_sf

# Creating a static visual for the paper
tiff(filename = '../Figures/clusterMap.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('cluster_group', title = 'Cluster #', palette = colorPal) +
  tm_credits(paste0('Based on Data from March 01, 2020 - ', endDatePrintV), position=c("right", "bottom"))
invisible( dev.off() ) # to suppress the unwanted output from dev.off


# Creating an interactive visual Using the Leaflet Package
#### Creating a longlat projection (required by leaflet)
leaflet_sf = counties_sf("longlat") %>% filter(!state %in% c('Alaska', 'Hawaii')) # from albersua
leaflet_sf %<>% geo_join(crossSectionalData, by_sp= 'fips', by_df= 'fips') %>% 
  left_join(clusterCounties[, c('fips', 'cluster_group')], by = 'fips')

#### Setting the Color Scheme
leafletPal =  colorFactor('Set2', domain = leaflet_sf$cluster_group, na.color = "white")

#### The visual
leaflet(height=500) %>% # initializing the leaflet map
  setView(lng = -96, lat = 37.8, zoom = 3.8) %>% # setting the view on Continental US
  addTiles() %>% # adding the default tiles
  addPolygons(data = leaflet_sf, stroke = FALSE, fillColor = ~leafletPal(leaflet_sf$cluster_group), # adding the data
              weight = 2, opacity = 1, color = "white", dashArray = "3", fillOpacity = 0.7, # adding color specs
              popup = paste("County:", leaflet_sf$name, '<br>', 
                            "Cluster #:", leaflet_sf$cluster_group, '<br>',
                            "Population Density:", round(leaflet_sf$popDensity, 1), '<br>')) %>% #pop-up Menu
  addLegend(position = "bottomleft", pal = leafletPal, values =  leaflet_sf$cluster_group, 
            title = "Cluster #", opacity = 1) # legend formatting

4 Explanatory Modeling of Cluster Assignments

In the previous section, we showed that by using solely a scaled and smoothed time series of daily cases per county, the counties are grouped into categories (whose time-series have distinct shapes based on the Euclidean distance measure). In this section, we attempt to model the factors that are associated with the cluster assignment.

4.1 Descriptive Statistics

multiClassDF = select(clusterCounties, id, cluster_group) %>% 
  left_join(crossSectionalData, by = 'id')  %>% 
  select(-c(administrative_area_level_2, administrative_area_level_3, id, key_numeric))

saveRDS(multiClassDF, '../Data/multiClassDF.rds') # saving the data

skim(multiClassDF) # printing a nice summary table of the data
Data summary
Name multiClassDF
Number of rows 3108
Number of columns 11
_______________________
Column type frequency:
character 3
factor 2
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
cluster_group 0 1.00 2 2 0 4 0
key_google_mobility 195 0.94 14 35 0 2913 0
fips 0 1.00 5 5 0 3108 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
countyType 0 1 FALSE 2 Rur: 1596, Oth: 1512
regions 0 1 FALSE 10 E: 524, F: 503, G: 412, C: 372

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
popDensity 1 1.00 369.63 6667.59 0.24 17.28 45.31 119.57 365169.38 ▇▁▁▁▁
percRepVotes 1 1.00 63.30 15.65 4.09 54.48 66.35 74.92 96.03 ▁▂▅▇▃
GovernmentResponseIndexMedian 1 1.00 47.57 7.86 26.94 44.30 47.22 50.66 70.28 ▁▃▇▂▁
PercentSeniors 36 0.99 24.85 5.52 5.80 21.30 24.45 27.80 64.20 ▁▇▂▁▁
icuBedsPer10000Seniors 36 0.99 5.49 9.43 0.00 0.00 0.00 8.71 248.16 ▇▁▁▁▁
povertyPercent 0 1.00 15.18 6.11 2.60 10.90 14.15 18.30 54.00 ▆▇▂▁▁

4.2 Boxplot By Cluster

multiClassDF %>% plot_boxplot(by = 'cluster_group', ncol = 2L, 
               ggtheme = theme_bw(),
               geom_boxplot_args = list('outlier.shape' = 1))

4.3 Explanatory Modeling Using Multinomial Spatial Regression

4.3.1 Data Preparation

multiClassDF$cluster_group %<>% as.factor() # convert to a factor

# impute without using cluster_group, key_google_mobility and fips
multiClassImputed = VIM::kNN(multiClassDF, imp_var = FALSE,
                             dist_var = colnames(multiClassDF)[3:10])
saveRDS(multiClassImputed, '../Data/multiClassImputed.rds') # saving the data

# Creating a df (which will be used for analysis)
df = multiClassImputed # setting df to equal to the multiclass object
df$clustReLeveled =  relevel(df$cluster_group, ref = maxCat(df$cluster_group) ) # setting the ref level
df  = df %>% select(-c(cluster_group, # removed since it is now redundant with the clustReLeveled variable
                       key_google_mobility, # removed since they are identifier variables
                       icuBedsPer10000Seniors, percRepVotes)) # did not sig. improve predictions
saveRDS(df, '../Data/df.rds') # saving the data

4.3.2 Model Building

multiClassDF.sorted = df[order(df$fips), ]

# File needs to be downloaded from 
# https://www2.census.gov/geo/tiger/TIGER2016/COUNTY/tl_2016_us_county.zip
shape =  readOGR('J:/My Drive/Miami/Code/GitHub/covid/Data', "tl_2016_us_county")
## OGR data source with driver: ESRI Shapefile 
## Source: "J:\My Drive\Miami\Code\GitHub\covid\Data", layer: "tl_2016_us_county"
## with 3233 features
## It has 17 fields
## Integer64 fields read as strings:  ALAND AWATER
shape_sp = st_as_sf(shape)

# Joining both the clustered data and the shape file in one object
joinedMultiClass = geo_join(shape_sp, multiClassDF.sorted, by_sp = 'GEOID', by_df ='fips',how = 'inner')
joinedMultiClass.sorted = joinedMultiClass[order(joinedMultiClass$GEOID), ]

# Building a neighborhood list from the shape data
nbList = poly2nb(joinedMultiClass.sorted) %>% 
  nb2listw(style = 'W', zero.policy = TRUE) # converting it to a listw object

# Identifying the number of neighbors per county
numCounties = length(nbList$neighbours) # number of counties
numNeighbors = rep(0, numCounties) # number of neighbors for each county
for (i in 1:numCounties)  numNeighbors[i] = length(nbList$neighbours[[i]])

# Identifying the fips of the neighbor counties
fipsOfNeighbourCounties = c() # initialization
for (i in 1:numCounties) fipsOfNeighbourCounties = c(fipsOfNeighbourCounties, nbList$neighbours[[i]] )

# Pre-nimble parameters
sumNumNeigh = length(fipsOfNeighbourCounties)
m = length(numNeighbors)

# Preparing the outcome
number_of_clusters = unique(multiClassDF.sorted$clustReLeveled) %>% length()
cluster_group_matrix = matrix( 0 , nrow = m, ncol = number_of_clusters)
cluster_group_nu = str_remove(multiClassDF.sorted$clustReLeveled, 'C') %>% as.numeric()

for (i in 1:m)
{
  cluster_group_matrix[ i , cluster_group_nu[i] ] = 1
}

## Storing each predictor in a vector
countyType = multiClassDF.sorted$countyType
Underserved = fastDummies::dummy_cols(countyType)
Underserved.rural = Underserved$`.data_Rural/Underserved` %>% as.matrix() %>% as.vector()
popDensity = multiClassDF.sorted$popDensity %>% as.matrix() %>% as.vector()
GovernmentResponseIndexMedian = multiClassDF.sorted$GovernmentResponseIndexMedian %>% as.matrix() %>% as.vector()
PercentSeniors = multiClassDF.sorted$PercentSeniors %>% as.matrix() %>% as.vector()
povertyPercent = multiClassDF.sorted$povertyPercent %>% as.matrix() %>% as.vector()
regions = multiClassDF.sorted$regions

results = dummy_cols(regions)
A = results$.data_A
B = results$.data_B
C = results$.data_C
D = results$.data_D
E = results$.data_E
FF = results$.data_F
G = results$.data_G
H = results$.data_H
I = results$.data_I



# Nimble Code spatial
code = nimbleCode(
  {
    for (i in 1:m)
    {
      cluster_group_matrix[i,1:number_of_clusters] ~ dmulti( prob = p[i,1:number_of_clusters] ,1) 
      phi[i,1] <- 1
      p[i,1] <- 1/sum(phi[i,1:number_of_clusters])
      for (k in 2:number_of_clusters)
      {
        log(phi[i,k]) <- b0[k] + b1[k]*Underserved.rural[i] + 
          b2[k]*popDensity[i] +
          b3[k]*PercentSeniors[i] + 
          b4[k]*GovernmentResponseIndexMedian[i] + 
          b5[k]*povertyPercent[i] + b6[k]*A[i] + b7[k]*B[i] + 
          b8[k]*C[i] + b9[k]*D[i] + b10[k]*E[i] + b11[k]*F[i] + b12[k]*G[i] + b13[k]*H[i] + b14[k]*I[i] + u[i]
        p[i,k] <- phi[i,k]/sum(phi[i,1:number_of_clusters])
      }
    }    
    for (k in 2:number_of_clusters) 
    {
      b0[k] ~ dnorm(0, 0.00001); b1[k] ~ dnorm(0, 0.00001); b2[k] ~ dnorm(0, 0.00001)
      b3[k] ~ dnorm(0, 0.00001); b4[k] ~ dnorm(0, 0.00001); b5[k] ~ dnorm(0, 0.00001)
      b6[k] ~ dnorm(0, 0.00001); b7[k] ~ dnorm(0, 0.00001); b8[k] ~ dnorm(0, 0.00001)
      b9[k] ~ dnorm(0, 0.00001); b10[k] ~ dnorm(0, 0.00001); b11[k] ~ dnorm(0, 0.00001)
      b12[k] ~ dnorm(0, 0.00001); b13[k] ~ dnorm(0, 0.00001); b14[k] ~ dnorm(0, 0.00001)
    }
    u[1:m] ~ dcar_normal(adj[1:sumNumNeigh], weights[1:sumNumNeigh], 
                         num[1:m],tauu)
    for (j in 1:sumNumNeigh)
    {weights[j] <- 1}
    tauu ~ dgamma(1,0.0001)
  }
)  

constants = list(num=numNeighbors, adj=fipsOfNeighbourCounties,
                  sumNumNeigh = length(fipsOfNeighbourCounties), 
                  m=m,number_of_clusters=number_of_clusters)

data = list(cluster_group_matrix = cluster_group_matrix, 
             Underserved.rural = Underserved.rural,
             popDensity = popDensity,
             PercentSeniors = PercentSeniors,
             GovernmentResponseIndexMedian = GovernmentResponseIndexMedian,
             povertyPercent = povertyPercent,
             A=A, B=B, C=C, D=D, E=E, F=FF,
             G=G, H=H, I=I)

inits = list(b0=rep(0, number_of_clusters), u=rep(0,m),tauu=1, b1=rep(0, number_of_clusters), 
             b2=rep(0, number_of_clusters), b3=rep(0, number_of_clusters), b4=rep(0, number_of_clusters),
             b5=rep(0, number_of_clusters), b6=rep(0, number_of_clusters), b7=rep(0, number_of_clusters), 
             b8=rep(0, number_of_clusters), b9=rep(0, number_of_clusters), b10=rep(0, number_of_clusters),
             b11=rep(0, number_of_clusters), b12=rep(0, number_of_clusters), b13=rep(0, number_of_clusters),
             b14=rep(0, number_of_clusters) )

Rmodel = nimbleModel(code=code, constants=constants, data=data, inits=inits)

compile.Rmodel = compileNimble( Rmodel )

monitors = c('b0','b1','b2','b3','b4','b5','b6','b7','b8',
              'b9','b10','b11','b12','b13','b14','p','tauu')

Rmodel.Conf = configureMCMC( Rmodel , monitors=monitors, thin = 100)
## ===== Monitors =====
## thin = 100: b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, p, tauu
## ===== Samplers =====
## RW sampler (46)
##   - b0[]  (3 elements)
##   - b1[]  (3 elements)
##   - b2[]  (3 elements)
##   - b3[]  (3 elements)
##   - b4[]  (3 elements)
##   - b5[]  (3 elements)
##   - b6[]  (3 elements)
##   - b7[]  (3 elements)
##   - b8[]  (3 elements)
##   - b9[]  (3 elements)
##   - b10[]  (3 elements)
##   - b11[]  (3 elements)
##   - b12[]  (3 elements)
##   - b13[]  (3 elements)
##   - b14[]  (3 elements)
##   - tauu
## CAR_normal sampler (1)
##   - u[1:3108]
Rmodel.MCMC = buildMCMC( Rmodel.Conf )
compile.Rmodel.MCMC = compileNimble( Rmodel.MCMC )

niter = 300000
nburn = 150000

start.time = proc.time()
spatial.base6 = runMCMC( compile.Rmodel.MCMC, niter = niter, nburnin = nburn,
                         inits = inits, nchains = 1, samplesAsCodaMCMC = TRUE )
## |-------------|-------------|-------------|-------------|
## |-------------------------------------------------------|
stop.time = proc.time()
time.elapsed = stop.time - start.time
print( time.elapsed )
##     user   system  elapsed 
## 80161.21     2.61 80199.27

4.4 Resulting Model

Betacoe = spatial.base6[, (1:60)] # based on current number of predictors
saveRDS(Betacoe, '../Data/betacoe.rds') # saving the data

# Computing the coefficients' values
coeffTable = rbind(colMeans(Betacoe), colSds(Betacoe))
rownames(coeffTable) = c('means', 'stdevs')

# Formating the output
coeffTable %<>% as.data.frame() %>% 
  select(paste0(paste0("b", rep(seq(0, 14), 4)), '[', 
                c(rep(1,15), rep(2,15), rep(3,15), rep(4,15)),
                ']' )) %>% # reordering cols by name
  select_if(~ !any(is.na(.)))  # dropping NA cols (corresponding to Cluster 1)

tCoeffTable = t(coeffTable)

tCoeffTable = cbind(tCoeffTable[1:15,], tCoeffTable[16:30,], tCoeffTable[31:45,]) %>% data.frame()
row.names(tCoeffTable) = c('constant', 'rural', 'popDensity', 'percSeniors',
                           'govResponse', 'percPoverty', 'regionA', 'regionB', 'regionC',
                           'regionD', 'regionE', 'regionF', 'regionG', 'regionH',
                           'regionI')
colnames(tCoeffTable) = c('C2_coef_mean', 'C2_coef_sd',
                          'C3_coef_mean', 'C3_coef_sd',
                          'C4_coef_mean', 'C4_coef_sd')

tCoeffTable %>% round(digits = 3) %>% datatable(
          extensions = c('FixedColumns', 'Buttons'), options = list(
            pageLength = 15,
            dom = 'Bfrtip',
            scrollX = TRUE,
            buttons = c('copy', 'csv', 'excel', 'pdf'),
            fixedColumns = list(leftColumns = 1)))

4.5 Model’s Performance

samples_p = spatial.base6[, -(1:60)]
samples_p_mean = colMeans(exp(samples_p[, 1:(number_of_clusters*m)]))
C1 = samples_p_mean[1:3108]
C2 = samples_p_mean[3109:6216]
C3 = samples_p_mean[6217:9324]
C4 = samples_p_mean[9325:12432]
pred.0 = cbind(C1, C2, C3, C4) 
pred = rep(NA, m)
for (i in 1:m) {
  vec = pred.0[i, 1:number_of_clusters]
  pred[i] = which.max(vec)
}

predicted.spatial = cbind(pred, multiClassDF.sorted[, 'fips']) 
colnames(predicted.spatial) = c('pred', 'fips')

predSpatialFinal = merge(predicted.spatial, multiClassDF.sorted[, c('clustReLeveled', 'fips')],
                         by = 'fips')

predSpatialFinal$clustReLeveled %<>%  str_remove('C')
saveRDS(predSpatialFinal, '../Data/predSpatialFinal.rds') # saving the data


# Computing the Confusion Metrics and By Class Metrics
confMatrix = confusionMatrix(as.factor(predSpatialFinal$pred), 
                             as.factor(predSpatialFinal$clustReLeveled))
saveRDS(confMatrix, '../Data/confMatrixSpatialModel.rds') # saving the data

# Printing the Resulting tables nicely
pander(confMatrix$table)
1 2 3 4
1017 0 0 0
0 355 96 118
0 53 267 150
0 154 168 730
pander(confMatrix$byClass)
Table continues below
  Sensitivity Specificity Pos Pred Value Neg Pred Value
Class: 1 1 1 1 1
Class: 2 0.6317 0.9159 0.6239 0.9185
Class: 3 0.5028 0.9212 0.5681 0.8999
Class: 4 0.7315 0.8474 0.6939 0.8696
Table continues below
  Precision Recall F1 Prevalence Detection Rate
Class: 1 1 1 1 0.3272 0.3272
Class: 2 0.6239 0.6317 0.6278 0.1808 0.1142
Class: 3 0.5681 0.5028 0.5335 0.1708 0.08591
Class: 4 0.6939 0.7315 0.7122 0.3211 0.2349
  Detection Prevalence Balanced Accuracy
Class: 1 0.3272 1
Class: 2 0.1831 0.7738
Class: 3 0.1512 0.712
Class: 4 0.3385 0.7894
pander(confMatrix$overall)
Table continues below
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
0.7622 0.6722 0.7469 0.7771 0.3272
AccuracyPValue McnemarPValue
0 NA

4.6 Visualizing the Model’s Outcomes

###Visualizing the Model’s Predictions
predSpatialFinal = readRDS('../Data/predSpatialFinal.rds') # saving the data
predSpatialFinal$match = ifelse(predSpatialFinal$pred == predSpatialFinal$clustReLeveled, "Yes", "No") %>%
  as.factor()

mapResults = predSpatialFinal
mapResults$fips %<>% as.factor()

# Retrieving the U.S. county composite map as a simplefeature (since it has been overwritten)
counties_sf = counties_sf("aeqd") %>% filter(!state %in% c('Alaska', 'Hawaii')) # from albersua
counties_sf %<>% geo_join(mapResults, by_sp= 'fips', by_df= 'fips')


# Creating a static visual for use in the paper
tiff(filename = '../Figures/clusterMatchMapSpatial.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(counties_sf) + tm_polygons('match', title = 'Cluster Match', style = 'cont', palette = "div") +
  tm_layout(aes.palette = list(div = list("Yes" = "#CAB2D6", "No" = "#6A3D9A"))) +
  tm_credits(paste0('Based on Data from March 01, 2020 - ', endDatePrintV), position=c("right", "bottom"))
invisible( dev.off() ) # to suppress the unwanted output from dev.off


# Creating a longlat projection (required by leaflet)
leaflet_sf = counties_sf("longlat") %>% filter(!state %in% c('Alaska', 'Hawaii')) # from albersua
leaflet_sf %<>% geo_join(mapResults, by_sp= 'fips', by_df= 'fips')

leafletPal = colorFactor(palette = c("#CAB2D6", "#6A3D9A"), levels = c('Yes', 'No'), na.color = "white")

leaflet(height = 500) %>% 
  setView(lng = -96, lat = 37.8, zoom = 3.8) %>%
  addTiles() %>%
  addPolygons(data = leaflet_sf, stroke = FALSE, fillColor = ~leafletPal(leaflet_sf$match),
              weight = 2, opacity = 1, color = "white", dashArray = "3", fillOpacity = 0.7,
              popup = paste("County:", leaflet_sf$name, '<br>', 
                            "Cluster #:", leaflet_sf$clustReLeveled, '<br>',
                            "Cluster Predicted:", leaflet_sf$pred, '<br>',
                            "Cluster Match:", leaflet_sf$match, '<br>')) %>% #pop-up Menu
  addLegend(position = "bottomleft", pal = leafletPal, values =  leaflet_sf$match, 
            title = "Cluster Match", opacity = 1)

5 References

Aghabozorgi, S., Shirkhorshidi, A. S., & Wah, T. Y. (2015). Time-series clustering–a decade review. Information Systems, 53, 16–38.

Charrad, M., Ghazzali, N., Boiteau, V., & Niknafs, A. (2014). NbClust: An r package for determining the relevant number of clusters in a data set. Journal of Statistical Software, Articles, 61(6), 1–36. https://doi.org/10.18637/jss.v061.i06

Guidotti, E., & Ardia, D. (2020). COVID-19 data hub. Journal of Open Source Software, 5(51), 2376. https://doi.org/10.21105/joss.02376

MIT Election Data and Science Lab. (2018). County Presidential Election Returns 2000-2016 (Version V6) [Data set]. Harvard Dataverse. https://doi.org/10.7910/DVN/VOQCHQ

Sardá-Espinosa, A. (2017). Comparing time-series clustering algorithms in r using the dtwclust package. R Package Vignette, 12, 41.


6 Appendices

6.1 Appendix A: Packages Used

In the appendix, we print all the R packages used in our analysis and their versions to assist with reproducing our results/analysis.

pander(sessionInfo(), compact = TRUE) # printing the session information

R version 4.0.3 (2020-10-10)

Platform: x86_64-w64-mingw32/x64 (64-bit)

locale: LC_COLLATE=English_United States.1252, LC_CTYPE=English_United States.1252, LC_MONETARY=English_United States.1252, LC_NUMERIC=C and LC_TIME=English_United States.1252

attached base packages: grid, stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: conflicted(v.1.0.4), caret(v.6.0-86), lattice(v.0.20-41), nnet(v.7.3-14), matrixStats(v.0.57.0), fastDummies(v.1.6.3), nimble(v.0.10.1), spdep(v.1.1-5), sf(v.0.9-6), spData(v.0.3.8), rgdal(v.1.5-18), sp(v.1.4-4), VIM(v.6.0.0), colorspace(v.2.0-0), NbClust(v.3.0), expsmooth(v.2.3), fma(v.2.4), forecast(v.8.13), fpp2(v.2.4), zoo(v.1.8-8), tmap(v.3.2), leaflet(v.2.0.3), tigris(v.1.0), plotly(v.4.9.2.2), tiff(v.0.1-6), DataExplorer(v.0.8.2), RColorBrewer(v.1.1-2), scales(v.1.1.1), knitr(v.1.30), stargazer(v.5.2.2), pander(v.0.6.3), DT(v.0.16), readxl(v.1.3.1), rvest(v.0.3.6), xml2(v.1.3.2), COVID19(v.2.3.1), skimr(v.2.1.2), dataPreparation(v.1.0.1), progress(v.1.2.2), Matrix(v.1.2-18), lubridate(v.1.7.9.2), janitor(v.2.0.1), magrittr(v.2.0.1), forcats(v.0.5.0), stringr(v.1.4.0), dplyr(v.1.0.2), purrr(v.0.3.4), readr(v.1.4.0), tidyr(v.1.1.2), tibble(v.3.0.4), tidyverse(v.1.3.0), albersusa(v.0.4.1), devtools(v.2.3.2), usethis(v.2.0.0), pacman(v.0.5.1), lemon(v.0.4.5) and ggplot2(v.3.3.3)

loaded via a namespace (and not attached): tidyselect(v.1.1.0), htmlwidgets(v.1.5.3), ranger(v.0.12.1), maptools(v.1.0-2), pROC(v.1.16.2), munsell(v.0.5.0), codetools(v.0.2-16), units(v.0.6-7), withr(v.2.3.0), uuid(v.0.1-4), rstudioapi(v.0.13), stats4(v.4.0.3), robustbase(v.0.93-6), vcd(v.1.4-8), TTR(v.0.24.2), repr(v.1.1.0), farver(v.2.0.3), rprojroot(v.2.0.2), LearnBayes(v.2.15.1), coda(v.0.19-4), vctrs(v.0.3.6), generics(v.0.1.0), ipred(v.0.9-9), xfun(v.0.19), R6(v.2.5.0), assertthat(v.0.2.1), networkD3(v.0.4), rgeos(v.0.5-5), gtable(v.0.3.0), lwgeom(v.0.2-5), processx(v.3.4.5), timeDate(v.3043.102), rlang(v.0.4.9), splines(v.4.0.3), lazyeval(v.0.2.2), ModelMetrics(v.1.2.2.2), dichromat(v.2.0-0), broom(v.0.7.3), reshape2(v.1.4.4), yaml(v.2.2.1), abind(v.1.4-5), modelr(v.0.1.8), crosstalk(v.1.1.0.1), backports(v.1.2.1), quantmod(v.0.4.18), lava(v.1.6.8.1), tools(v.4.0.3), ellipsis(v.0.3.1), raster(v.3.4-5), sessioninfo(v.1.1.1), Rcpp(v.1.0.5), plyr(v.1.8.6), base64enc(v.0.1-3), classInt(v.0.4-3), ps(v.1.5.0), prettyunits(v.1.1.1), rpart(v.4.1-15), deldir(v.0.2-3), fracdiff(v.1.5-1), tmaptools(v.3.1), haven(v.2.3.1), fs(v.1.5.0), leafem(v.0.1.3), data.table(v.1.13.6), openxlsx(v.4.2.3), gmodels(v.2.18.1), lmtest(v.0.9-38), reprex(v.0.3.0), pkgload(v.1.1.0), hms(v.0.5.3), evaluate(v.0.14), XML(v.3.99-0.5), rio(v.0.5.16), gridExtra(v.2.3), testthat(v.3.0.1), compiler(v.4.0.3), KernSmooth(v.2.23-17), crayon(v.1.3.4), htmltools(v.0.5.0), expm(v.0.999-5), DBI(v.1.1.0), dbplyr(v.2.0.0), MASS(v.7.3-53), rappdirs(v.0.3.1), boot(v.1.3-25), car(v.3.0-10), cli(v.2.2.0), gdata(v.2.18.0), quadprog(v.1.5-8), gower(v.0.2.2), parallel(v.4.0.3), igraph(v.1.2.6), pkgconfig(v.2.0.3), foreign(v.0.8-80), laeken(v.0.5.1), recipes(v.0.1.15), foreach(v.1.5.1), prodlim(v.2019.11.13), snakecase(v.0.11.0), callr(v.3.5.1), digest(v.0.6.27), rmarkdown(v.2.6), cellranger(v.1.1.0), leafsync(v.0.1.0), curl(v.4.3), gtools(v.3.8.2), urca(v.1.3-0), lifecycle(v.0.2.0), nlme(v.3.1-149), jsonlite(v.1.7.2), tseries(v.0.10-48), carData(v.3.0-4), desc(v.1.2.0), viridisLite(v.0.3.0), fansi(v.0.4.1), pillar(v.1.4.7), survival(v.3.2-7), httr(v.1.4.2), DEoptimR(v.1.0-8), pkgbuild(v.1.2.0), glue(v.1.4.2), xts(v.0.12.1), remotes(v.2.2.0), zip(v.2.1.1), iterators(v.1.0.13), png(v.0.1-7), leaflet.providers(v.1.9.0), class(v.7.3-17), stringi(v.1.5.3), stars(v.0.4-3), memoise(v.1.1.0) and e1071(v.1.7-4)

6.2 Appendix B: Baseline Multinomial Model

6.2.1 Model Building

df = readRDS('../Data/df.rds') # loading the data
df %<>% select(-fips) # removed since it was only used in the spatial model 
finalModel =  quiet(multinom(clustReLeveled ~ ., data = df)) # building the multinomial model

6.2.2 Resulting Model

# tabulating the model results as an HTML table, which we print below
stargazer(finalModel, type = 'html', p.auto = FALSE, out="../Data/multi.html", single.row = FALSE)
Dependent variable:
C2 C3 C4
(1) (2) (3)
countyTypeRural/Underserved -0.869*** 0.404*** -0.441***
(0.0004) (0.0004) (0.0005)
popDensity 0.0003** 0.0001 0.0001
(0.0001) (0.0001) (0.0001)
GovernmentResponseIndexMedian -0.028*** 0.039*** 0.051***
(0.007) (0.006) (0.005)
PercentSeniors -0.042*** 0.047*** 0.010
(0.012) (0.009) (0.009)
regionsB 72.713*** -0.029*** 0.426***
(0.0001) (0.0001) (0.0001)
regionsC 75.563*** 1.348*** 0.718***
(0.0002) (0.0001) (0.0001)
regionsD 72.547*** -1.807*** -0.717***
(0.0002) (0.0001) (0.0002)
regionsE -31.816 -3.853*** -2.897***
(0.0001) (0.001)
regionsF 71.946*** 0.056*** -1.272***
(0.0001) (0.0002) (0.0002)
regionsG 66.742*** -2.753*** -3.577***
(0.00002) (0.0003) (0.0002)
regionsH 66.344*** -2.093*** -3.236***
(0.00002) (0.0002) (0.0001)
regionsI 73.312*** 0.334*** -0.307***
(0.0001) (0.00005) (0.0001)
regionsJ 70.817*** -1.307*** -1.892***
(0.00004) (0.00004) (0.00003)
povertyPercent 0.065*** 0.051*** -0.013
(0.011) (0.010) (0.010)
Constant -70.024*** -3.016*** -0.269***
(0.0001) (0.0001) (0.00004)
Akaike Inf. Crit. 5,826.064 5,826.064 5,826.064
Note: p<0.1; p<0.05; p<0.01

6.2.3 Model’s Performance

# examining how well the model performed on our entire dataset
# Recall that we are fitting an explanatory model, and not a predictive model
predictedClass = predict(finalModel, df)
saveRDS(finalModel, '../Data/finalModel.rds') # saving the data

# Computing the Confusion Metrics and By Class Metrics
confMatrix = confusionMatrix(predictedClass, df$clustReLeveled)
saveRDS(confMatrix, '../Data/confMatrix.rds') # saving the data

# Printing the resulting tables nicely
pander(confMatrix$table)
  C1 C2 C3 C4
C1 865 7 125 260
C2 24 366 94 132
C3 57 50 177 98
C4 71 139 135 508
pander(confMatrix$byClass)
Table continues below
  Sensitivity Specificity Pos Pred Value Neg Pred Value
Class: C1 0.8505 0.8125 0.6881 0.9179
Class: C2 0.6512 0.9018 0.5942 0.9213
Class: C3 0.3333 0.9205 0.4634 0.8701
Class: C4 0.509 0.8365 0.5955 0.7827
Table continues below
  Precision Recall F1 Prevalence Detection Rate
Class: C1 0.6881 0.8505 0.7608 0.3272 0.2783
Class: C2 0.5942 0.6512 0.6214 0.1808 0.1178
Class: C3 0.4634 0.3333 0.3877 0.1708 0.05695
Class: C4 0.5955 0.509 0.5489 0.3211 0.1634
  Detection Prevalence Balanced Accuracy
Class: C1 0.4044 0.8315
Class: C2 0.1982 0.7765
Class: C3 0.1229 0.6269
Class: C4 0.2745 0.6728
pander(confMatrix$overall)
Table continues below
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
0.6165 0.4693 0.5991 0.6336 0.3272
AccuracyPValue McnemarPValue
1.667e-238 2.077e-32

6.2.4 Visualizing the Model’s Predictions

predictedProbs = fitted(finalModel) # computing predicted probabilities for each of the cluster outcome levels
mapResults = cbind(multiClassDF, predictedProbs) # col binding predProbs for Each Cluster with multiClassDF

# Finding indices to subset the data
numberOfClusters = unique(mapResults$cluster_group) %>% as.character() %>% length() 
startCol = ncol(mapResults) - numberOfClusters + 1
endCol = ncol(mapResults)

# Finding whether the predicted and actual clusters matched for each county
mapResults$LargestProbCluster = colnames(mapResults[, startCol:endCol])[apply(mapResults[, startCol:endCol], 1, which.max)] 
mapResults$match = ifelse(mapResults$cluster_group == mapResults$LargestProbCluster, 'Yes', 'No') %>% as.factor()

# Retrieving the U.S. county composite map as a simplefeature (since it has been overwritten)
cty_sf = counties_sf("aeqd") %>% filter(!state %in% c('Alaska', 'Hawaii')) # from albersusa
cty_sf %<>% geo_join(mapResults, by_sp= 'fips', by_df= 'fips')

# Creating a static visual for use in the paper
tiff(filename = '../Figures/clusterMatchMap.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('match', title = 'Cluster Match', style = 'cont', palette = "div") +
  tm_layout(aes.palette = list(div = list("Yes" = "#CAB2D6", "No" = "#6A3D9A"))) +
  tm_credits(paste0('Based on Data from March 01, 2020 - ', endDatePrintV), position=c("right", "bottom"))
invisible( dev.off() ) # to suppress the unwanted output from dev.off


# Creating an interactive visual Using the Leaflet Package
#### Creating a longlat projection (required by leaflet)
leaflet_sf = counties_sf("longlat") %>% filter(!state %in% c('Alaska', 'Hawaii')) # from albersua
leaflet_sf %<>% geo_join(mapResults, by_sp= 'fips', by_df= 'fips')

#### Setting the Color Scheme
leafletPal =  colorFactor(palette = c("#CAB2D6", "#6A3D9A"), levels = c('Yes', 'No'), na.color = "white")

#### The visual
leaflet(height=500) %>% # initializing the leaflet map
  setView(lng = -96, lat = 37.8, zoom = 3.8) %>% # setting the view on Continental US
  addTiles() %>% # adding the default tiles
  addPolygons(data = leaflet_sf, stroke = FALSE, fillColor = ~leafletPal(leaflet_sf$match), # adding the data
              weight = 2, opacity = 1, color = "white", dashArray = "3", fillOpacity = 0.7, # adding color specs
              popup = paste("County:", leaflet_sf$name, '<br>', 
                            "Cluster #:", leaflet_sf$cluster_group, '<br>',
                            "Cluster Predicted:", leaflet_sf$LargestProbCluster, '<br>',
                            "Cluster Match:", leaflet_sf$match, '<br>')) %>% #pop-up Menu
  addLegend(position = "bottomleft", pal = leafletPal, values =  leaflet_sf$match, 
            title = "Cluster Match", opacity = 1) # legend formatting
---
title: "A Retrospective Two-Stage Analysis of COVID Cases by County"
author:
  - name: "Fadel M. Megahed ^[Email: fmegahed@miamioh.edu | Phone: +1-513-529-4185 | Website: <a href=\"https://miamioh.edu/fsb/directory/?up=/directory/megahefm\">Miami University Official</a>]"
    affiliation: Farmer School of Business, Miami University
  - name: "Allison Jones-Farmer ^[Email: farmerl2@miamioh.edu | Phone: +1-513-529-4823 | Website: <a href=\"https://miamioh.edu/fsb/directory/?up=/directory/farmerl2\">Miami University Official</a>]"
    affiliation: Farmer School of Business, Miami University
  - name: "Longwen Zhao ^[Email: longwen.zhao@slu.edu | Website: <a href=\"https://www.linkedin.com/in/longwen-zhao-06916486\">LinkedIn Site</a>]"
    affiliation: College of  Public Health and Social Justice, Saint Louis University
  - name: "Steve Rigdon ^[Email: steve.rigdon@slu.edu | Website: <a href=\"https://www.slu.edu/public-health-social-justice/faculty/rigdon-steven.php\">Saint Louis University Official</a>]"
    affiliation: College of  Public Health and Social Justice, Saint Louis University
bibliography: covidRefs.bib
csl: apa.csl
date: "`r format(Sys.time(), '%B %d, %Y')`"
output: 
  html_document:
    toc: TRUE
    toc_float: TRUE
    number_sections: TRUE
    theme: simplex
    paged_df: TRUE
    code_folding: show
    code_download: TRUE
  includes:
    in_header: structure.tex
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,
                      warning = FALSE,
                      message = FALSE,
                      cache = TRUE,
                      progress = FALSE, 
                      verbose = FALSE,
                      dpi = 600,
                      dev = c('png', 'postscript'),
                      out.width = '100%')
options(qwraps2_markup = "markdown")

library(ggplot2); theme_set(theme_bw(base_size = 16, base_family = "Arial")) # setting the preferred ggplot theme to bw
library(lemon); knit_print.tbl = lemon_print
```

# R Setup and Required Packages
In this project, the open-source R programming language is used to model the progression in the COVID-19 pandemic in different U.S. counties. R is maintained by an international team of developers who make the language available at [The Comprehensive R Archive Network](https://cran.r-project.org/). Readers interested in reusing our code and reproducing our results should have R installed locally on their machines. R can be installed on a number of different operating systems (see [Windows](https://cran.r-project.org/bin/windows/), [Mac](https://cran.r-project.org/bin/macosx/), and [Linux](https://cran.r-project.org/bin/linux/) for the installation instructions for these systems). We also recommend using the RStudio interface for R. The reader can [download RStudio](http://www.rstudio.com/ide) for free by following the instructions at the link. For non-R users, we recommend the [Hands-on Programming with R](https://rstudio-education.github.io/hopr/packages.html) for a brief overview of the software's functionality. Hereafter, we assume that the reader has an introductory understanding of the R programming language.

In the code chunk below, we load the packages used to support our analysis. Note that the code of this and any of the code chunks can be hidden by clicking on the 'Hide' button to facilitate the navigation. **The reader can hide all code and/or download the Rmd file associated with this document by clicking on the Code button on the top right corner of this document.** Our input and output files can also be accessed/ downloaded from [fmegahed/covid19](https://github.com/fmegahed/covid19). 

```{r packages, cache=FALSE}
if(require(pacman)==FALSE) install.packages("pacman") # check to see if the pacman package is installed; if not install it
if(require(devtools)==FALSE) install.packages("devtools") # check to see if the devtools package is installed; if not install it

# to check and install if these packages are not found locally on machine
if(require(albersusa)==FALSE) devtools::install_github('hrbrmstr/albersusa') #install package if needed
if(require(albersusa)==FALSE) devtools::install_github('dreamRs/r2d3maps') #install package if needed


# check if packages are not installed; if yes, install missing packages
pacman::p_load(tidyverse, magrittr, janitor, dataPreparation, lubridate, skimr, # for data analysis
               COVID19, rvest, readxl, # for extracting relevant data
               DT, pander, stargazer, knitr, # for formatting and nicely printed outputs
               scales, RColorBrewer, DataExplorer, tiff, grid,# for plots
               plotly, albersusa, tigris, leaflet, tmap, # for maps
               zoo, fpp2, NbClust, # for TS analysis and clustering
               VIM, rgdal,  spdep,  nimble, fastDummies, matrixStats, # for spatial regression 
               nnet, caret, # for multinomial regression modeling
               conflicted) # for managing conflicts in functions with same names

# Handling conflicting function names from packages
conflict_prefer('combine', 'dplyr') # Preferring dplyr::combine over any other package
conflict_prefer('select', "dplyr") #Preferring dplyr::select over any other package
conflict_prefer("summarize", "dplyr") # similar to above but with dplyr::summarize
conflict_prefer("filter", "dplyr") # Preferring filter from dplyr
conflict_prefer("dist", "stats") # Preferring dist from stats
conflict_prefer("as.dist", "stats") # Preferring as.dist from stats

# Custom Functions
source_url('https://raw.githubusercontent.com/fmegahed/covid19-deaths/master/Markdown/custom_functions.R')

set.seed(2020) # to assist with reproducibility
sInfo = sessionInfo() # saving all the packages/functions and session info
```

# Extracting the Datasets

For our analysis, we fuse data from multiple sources. We describe the process of obtaining and merging each of these sources in the subsections below.


## Time Series Data
In this section, we utilize the [COVID19 package](https://cran.r-project.org/web/packages/COVID19/COVID19.pdf) to obtain the following information: [@Guidotti2020]    

  - **Confirmed cases, recoveries and deaths**;    
  - **policy information** (e.g., transport closing, school closing, closing event, movement restrictions, testing policies, and contact tracing);  and
  - **Population and standard geographic information** for each county. 

From this information, we have also computed the new daily and weekly confirmed cases/deaths per county. The data is stored in a tidy format, but can be expanded to a wide format using `pivot_wider()` from the [tidyverse](https://www.tidyverse.org/) package.

```{r confirmedCases, results='asis'}
endDate = '2021-01-02'
endDatePrintV = format(ymd(endDate), format = "%b %d, %Y")

counties = covid19(country = "US", 
                   level = 3, # for county
                   start = "2020-03-01", # First Sunday in March
                   end = endDate, # end Date 
                   raw = FALSE, # to ensure that all counties have the same grid of dates
                   amr = NULL, # we are not using the apple mobility data for our analysis
                   gmr = NULL, # we are not using the Google mobility data for our analysis
                   wb = NULL, # world bank data not helpful for county level analysis
                   verbose = FALSE)

counties %<>% # next line removes non-contiguous US states/territories
  filter(!administrative_area_level_2 %in% c('Alaska', 'Hawaii', 'Puerto Rico', 'Northern Mariana Islands', 'Virgin Islands')) %>% 
  fast_filter_variables(verbose = FALSE) %>% #dropping invariant columns or bijections
  filter(!is.na(key_numeric)) %>%  # these are not counties
  group_by(id) %>% # grouping the data by the id column to make computations correct
  arrange(id, date) %>% # to ensure correct calculations
  mutate(day = wday(date, label = TRUE) %>% factor(ordered = F), # day of week
         newCases = c(NA, diff(confirmed)), # computing new daily cases per county
         newDeaths = c(NA, diff(deaths)) )  # computing new daily deaths per county

# manually identifying factor variables
factorVars = c("school_closing", "workplace_closing", "cancel_events",
               "gatherings_restrictions", "transport_closing", "stay_home_restrictions",
               "internal_movement_restrictions", "international_movement_restrictions",
               "information_campaigns", "testing_policy", "contact_tracing")

counties %<>% # converting those variables into character and then factor
  mutate_at(.vars = vars(any_of(factorVars)), .funs = as.character) %>% 
  mutate_at(.vars = vars(any_of(factorVars)), .funs = as.factor)

# Saving the data into an RDS file
saveRDS(counties, paste0("../Data/counties.rds"))
```

## Cross Sectional Data

In the code chunk below, we obtain seven additional datasets, whose variables can explain the differences between the time-series of the number of COVID cases per county:  

  A. *Rural/ Underserved Counties:* From the [Consumer Financial Protection Bureau](https://www.consumerfinance.gov/policy-compliance/guidance/mortgage-resources/rural-and-underserved-counties-list/), we have obtained the Final 2020 List titled: *Rural or underserved counties*. Per the website, the procedure for determining the classification of a county is as follows: "Beginning in 2020, the rural or underserved counties lists use a methodology for identifying underserved counties described in the Bureau’s interpretive rule: Truth in Lending Act (Regulation Z); [Determining “Underserved” Areas Using Home Mortgage Disclosure Act Data](https://www.consumerfinance.gov/policy-compliance/rulemaking/final-rules/truth-lending-regulation-z-underserved-areas-home-mortgage-disclosure-act-data/)."   

  B. Based on the [US Census Data](https://www.census.gov/library/publications/2011/compendia/usa-counties-2011.html), we extracted the  land area in square miles for each county, which we combined with population to compute **each county's population density**, which we hypothesize to be predictive of hotspots for COVID transmission based on the available COVID-19 literature.  
  
  C. Based on the @DVN/VOQCHQ_2018, we have obtained the voting results for all counties in the 2016 Presidential elections. The data was used to compute the percentage of total votes that went to President Trump, with the underlying hypothesis that the politicization of COVID response (e.g., perception/willingness to use face masks, policies and the population’s reaction to the disease) may be explained by party affiliation.
  
  D. We extracted **an overall government response index capturing the strength of COVID-19 response policies on a state (and the District of Columbia) level** from the [Blavatnik School of Government's GitHub Repository](https://github.com/OxCGRT/USA-covid-policy). This index captures 13 different indicators, capturing the ``full range of government response''. Details for how this indicator is computed can be found at [BSG-WP-2020/034](https://www.bsg.ox.ac.uk/sites/default/files/2020-08/BSG-WP-2020-034.pdf).  
  
  E. Based on the [following Kaiser Health News Webpage](https://khn.org/news/as-coronavirus-spreads-widely-millions-of-older-americans-live-in-counties-with-no-icu-beds/#lookup), we extracted by county information on the **percent of population aged 60+** and the **number of ICU beds per Seniors**.  
  
  F. We have engineered a `region` variable based on the [CDC's 10 Regions Framework](https://www.cdc.gov/coordinatedchronic/docs/nccdphp-regions-map.pdf). While geographic regions are hypothesized to be a factor in disease outbreaks, we chose to utilize the CDC regions specifically based on the following explanation from the aforementioned link:  
  > "CDC’s National Center for Chronic Disease Prevention and Health Promotion (NCCDPHP) is strengthening the consistency and quality of the guidance, communications, and technical assistance provided to states to improve coordination across our state programs"

  G. Based on the [Census's Small Area Income and Poverty Estimates (SAIPE) Program](https://www.census.gov/programs-surveys/saipe.html), we extracted the estimate for the **percent of population in poverty**. The estimate is based on 2018 data (released in December 2019). At the time of the start of our analysis, these estimates were the most up to date publicly available data.


```{r possibleXvars}
crossSectionalData = counties %>% ungroup() %>% 
  select(id, key_numeric, key_google_mobility, population,
         administrative_area_level_2, administrative_area_level_3) %>%
  unique()

# [A] Rural or Urban Classification of the County
ru = read.csv("https://www.consumerfinance.gov/documents/8911/cfpb_rural-underserved-list_2020.csv")
ru %<>%  transmute(key_numeric = FIPS.Code, #renaming FIPS.Code to key_numeric 
                countyType = "Rural/Underserved") # creates two vars and drop old vars
crossSectionalData = merge(crossSectionalData, ru, by = "key_numeric", all.x = TRUE) # to define NA counties
crossSectionalData$countyType %<>% replace_na("Other") # for any county not in the Consumer FIN data replace NA by Other


# [B] Population Density of Each County
download.file("https://www2.census.gov/library/publications/2011/compendia/usa-counties/excel/LND01.xls",
              destfile = "../Data/LND01.xls", mode = "wb") # downloading Land Area Data Per the 2010 Census
areas = read_excel("../Data/LND01.xls") %>% # reading the Excel file
  select(STCOU, LND110210D) #selecting only the FIPS and the Land Area from the 2010 Census variables
colnames(areas) = c("key_numeric", "LandAreaSqMiles2010") # Renaming the columns
areas$key_numeric %<>% as.numeric() # to remove leading 0 

crossSectionalData = merge(crossSectionalData, areas, by ="key_numeric", all.x = TRUE) # adding the area to data frame
crossSectionalData$popDensity = crossSectionalData$population / crossSectionalData$LandAreaSqMiles2010 # creating the population density variable
crossSectionalData %<>% select(-c(population, LandAreaSqMiles2010)) #dropping two variables used in creating pop density 


# [C] 2016 Presidential Elections County Data from Harvard https://doi.org/10.7910/DVN/VOQCHQ
elections = read.csv("../Data/countypres_2000-2016.csv") %>% # reading the downloaded CSV
  filter(year == 2016 & party == "republican") %>% # just keeping data for recent election and republican votes
  mutate(key_numeric = FIPS, # renaming FIPS to key_numeric
         percRepVotes = 100*(candidatevotes/totalvotes) ) %>% # computing percent of republican votes (from total votes)
  select(key_numeric, percRepVotes) # keeping only the key and variable used in merge
crossSectionalData %<>%  merge(elections, by = "key_numeric", all.x = TRUE) # merge with the counties data


# [D] Policy Data
policy = read_csv('https://raw.githubusercontent.com/OxCGRT/USA-covid-policy/master/data/OxCGRT_US_latest.csv')
policy = filter(policy, !is.na(RegionName) | !RegionName %in% c('Alaska', 'Hawaii'))
policy$state = toupper(policy$RegionName) # a state variable = an upper case of existing RegionName
policy$Date %<>% ymd() # converting the Date data to a date format

policySummary = policy %>% # calculating a summary table of median value for the GovernmentResponseIndex per state
  filter(Date >= '2020-03-01' & Date <= endDate) %>% # to match our COVID Data timeSeries
  group_by(state) %>% # perform computations using the median value, per state, for each index
  summarise(GovernmentResponseIndexMedian = median(GovernmentResponseIndex, na.rm = TRUE))
policySummary$state %<>%  str_replace('WASHINGTON DC', 'DISTRICT OF COLUMBIA') %>% str_to_title()

crossSectionalData %<>%  merge(policySummary, by.x = "administrative_area_level_2", by.y = 'state', all.x = TRUE) 


# [E] Kaiser Health News Data on the County Level
hospitals = read.csv("../Data/data-FPBfZ.csv") %>% # downloaded from KHN on 2020-10-26 (~9:30 pm EDT)
  transmute(State = State, # keeping the State Variable | transmute drops variables that are not in call
            County = County, # keeping the County Variable
            PercentSeniors = Percent.of.Population.Aged.60., # Shortening Original Variable Name
            icuBedsPer10000Seniors = 10000 * ICU.Beds/Population.Aged.60.) # Computing icuBedsPer10000Seniors

crossSectionalData %<>% merge(hospitals, 
                              by.x = c("administrative_area_level_2", "administrative_area_level_3"),
                              by.y = c("State", "County"), all.x = TRUE)


# [F] CDC Regions for Each State
regionsCDC = data.frame(States = c('Connecticut', 'Maine', 'Massachusetts', 'New Hampshire', 'Rhode Island' , 
                                   'Vermont', 'New York', # End of Region A
                                   'Delaware', 'District of Columbia', 'Maryland', 'Pennsylvania',
                                   'Virginia', 'West Virginia', 'New Jersey', # End of Region B
                                   'North Carolina', 'South Carolina', 'Georgia', 'Florida', # Region C
                                   'Kentucky', 'Tennessee', 'Alabama', 'Mississippi', # Region D
                                   'Illinois', 'Indiana', 'Michigan', 'Minnesota', 'Ohio',
                                   'Wisconsin', # End of Region E
                                   'Arkansas', 'Louisiana', 'New Mexico', 'Oklahoma', 'Texas', # Region F
                                   'Iowa', 'Kansas', 'Missouri', 'Nebraska', # Region G
                                   'Colorado', 'Montana', 'North Dakota', 'South Dakota',
                                   'Utah', 'Wyoming', # End of Region H
                                   'Arizona', 'California', 'Hawaii', 'Nevada', # Region I
                                   'Alaska', 'Idaho', 'Oregon', 'Washington' # Region J
                                   ),
                        regions = c(rep('A', 7), rep('B', 7), rep('C', 4),
                                    rep('D', 4), rep('E', 6), rep('F', 5),
                                    rep('G', 4), rep('H', 6), rep('I', 4),
                                    rep('J', 4) ) )

crossSectionalData %<>% merge(regionsCDC, by.x = 'administrative_area_level_2', by.y = 'States', all.x = TRUE) # merge


# [G] Poverty Estimates
download.file("https://www2.census.gov/programs-surveys/saipe/datasets/2018/2018-state-and-county/est18all.xls", 
              destfile = "../Data/est18all.xls", mode = "wb") # downloading the data for poverty estimates (latest 2018)

poverty = read_excel("../Data/est18all.xls", skip = 3) %>% # reading the data in R
  transmute(key_numeric = paste0(`State FIPS Code`, `County FIPS Code`) %>% as.numeric, # creating the key from two variables
            povertyPercent = as.numeric(`Poverty Percent, All Ages`) ) # shortening povertyPercent Variable's Name
crossSectionalData %<>% merge(poverty, by = "key_numeric", all.x = TRUE) # merge


# Final Transformations before Saving the Counties Data
crossSectionalData %<>%  mutate_at(.vars = c('countyType', 'regions'), as.factor)  # converting the two vars to factor

# Saving the data into an RDS file
saveRDS(crossSectionalData, paste0("../Data/crossSectionalData.rds"))

# Tabulating the results and providing a way to export the table to different formats
datatable(crossSectionalData %>% select(-c(id, key_numeric, administrative_area_level_2, administrative_area_level_3)),
          extensions = c('FixedColumns', 'Buttons'), options = list(
            dom = 'Bfrtip',
            scrollX = TRUE,
            buttons = c('copy', 'csv', 'excel', 'pdf'),
            fixedColumns = list(leftColumns = 1)),
          rownames = FALSE) %>% 
  formatRound(columns= c('popDensity', 'percRepVotes', 'GovernmentResponseIndexMedian',
                         'PercentSeniors', 'icuBedsPer10000Seniors', 'povertyPercent'),
              digits=1)
```


## Exploratory Analysis

In this section, we perform an exploratory analysis on the data obtained from the multiple sources.

### Cumulative Cases

```{r cumulativeCasesFig}
noGoogleNAs = filter(crossSectionalData, !is.na(key_google_mobility)) # removing NAs from key_google_mobility
idIndex = sample(noGoogleNAs$id, 9) # sampling 9 counties by id

# Saving cumulative deaths figure to an tiff file
tiff(filename = '../Figures/sampleCumulativeCases.tiff',
    width = 1366, height =768, pointsize = 16)
counties %>% filter(id %in% idIndex) %>% 
  ggplot(aes(x = date, y = confirmed, group = id, color = key_google_mobility)) +
  geom_line(size = 1.25) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  facet_wrap(~ key_google_mobility, scales = 'free_y', ncol = 3) +
  theme(legend.position = 'none') + 
  labs(color = '', x = 'Month', y = 'Cumulative Cases By County',
       caption = paste0('Based on Data from March 01, 2020 - ', endDatePrintV)) +
  scale_color_brewer(type = 'qual', palette = 'Paired')
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Creating an interactive plot for the markdown
p = ggplot2::last_plot() + geom_line(size = 0.75) + # modifying the plot for plotly
  theme_bw(base_size = 9) + theme(legend.position = 'none') # to make margins smaller
ggplotly(p, height = 768) %>%  layout_ggplotly()

```


### New Daily Cases

```{r newCasesFig}
# Saving new daily cases figure to an tiff file
tiff(filename = '../Figures/sampleNewDailyCases.tiff',
    width = 1366, height =768, pointsize = 16)
counties %>% filter(id %in% idIndex) %>% 
  ggplot(aes(x = date, y = newCases, group = id, color = key_google_mobility)) +
  geom_line(size = 1.25) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  facet_wrap(~ key_google_mobility, scales = 'free_y', ncol = 3) +
  theme(legend.position = 'none') + 
  labs(color = '', x = 'Month', y = 'New Daily Cases By County',
       caption = paste0('Based on Data from March 01, 2020 - ', endDatePrintV)) +
  scale_color_brewer(type = 'qual', palette = 'Paired')
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Creating an interactive plot for the markdown
p = ggplot2::last_plot() + geom_line(size = 0.75) + # modifying the plot for plotly
  theme_bw(base_size = 9) + theme(legend.position = 'none') # to make margins smaller
ggplotly(p, height = 768) %>%  layout_ggplotly()
```


### County Types

```{r countyTypes}
crossSectionalData$fips = str_pad(crossSectionalData$key_numeric,
                                     width = 5, side = 'left', pad = '0')
# Retrieving the U.S. county composite map as a simplefeature
cty_sf = counties_sf("aeqd") %>% filter(!state %in% c('Alaska', 'Hawaii')) # from albersua
cty_sf %<>% geo_join(crossSectionalData, by_sp= 'fips', by_df= 'fips')

# Saving a higher quality tiff file for use in the paper
tiff(filename = '../Figures/countyTypes.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('countyType', title = 'County Type', palette = "Paired")
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Printing the png output in the Markdown doc
tm_shape(cty_sf) + tm_polygons('countyType', title = 'County Type', palette = "Paired")
```


### Population Density

```{r popDensityFig}
# Saving a higher quality tiff file for use in the paper
tiff(filename = '../Figures/popDensity.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('popDensity', title = 'Population Density', palette = "Greens",
                               style = 'quantile')
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Printing the png output in the Markdown doc
tm_shape(cty_sf) + tm_polygons('popDensity', title = 'Population Density', palette = "Greens",
                               style = 'quantile')
```

### Percent Republican Votes

```{r percRepVotesFig}
# Saving a higher quality tiff file for use in the paper
tiff(filename = '../Figures/repVotes.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('percRepVotes', title = '% Republican Votes', palette = "Reds")
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Printing the png output in the Markdown doc
tm_shape(cty_sf) + tm_polygons('percRepVotes', title = '% Republican Votes', palette = "Reds")
```


### Government Response Data
```{r govRespFig}
state_sf = usa_sf("aeqd") %>% filter(!name %in% c('Alaska', 'Hawaii')) # from albersua
state_sf %<>% geo_join(crossSectionalData, by_sp= 'name', by_df= 'administrative_area_level_2')

# Saving a higher quality tiff file for use in the paper
tiff(filename = '../Figures/govResponse.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(state_sf) + tm_polygons('GovernmentResponseIndexMedian', 
                                 title = 'Median Value of the Government Response Index', palette = "-Greens")
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Printing the png output in the Markdown doc
tm_shape(state_sf) + tm_polygons('GovernmentResponseIndexMedian', 
                                 title = 'Median Value of the Government Response Index', palette = "-Greens")
```


### Percent Seniors

```{r percSeniorsFig}
# Saving a higher quality tiff file for use in the paper
tiff(filename = '../Figures/percSeniors.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('PercentSeniors', title = '% Seniors', palette = "Greens")
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Printing the png output in the Markdown doc
tm_shape(cty_sf) + tm_polygons('PercentSeniors', title = '% Seniors', palette = "Greens")
```

### CDC Regions

```{r cdcRegsFig}
# Saving a higher quality tiff file for use in the paper
tiff(filename = '../Figures/cdcRegions.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(state_sf) + tm_polygons('regions', title = 'CDC Region', palette = "Paired")
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Printing the png output in the Markdown doc
tm_shape(state_sf) + tm_polygons('regions', title = 'CDC Region', palette = "Paired")
```

### Percent Poverty

```{r povertyPercentFig}
# Saving a higher quality tiff file for use in the paper
tiff(filename = '../Figures/povertyPercent.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('povertyPercent', title = 'Poverty Percent', palette = "Greens")
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Printing the png output in the Markdown doc
tm_shape(cty_sf) + tm_polygons('povertyPercent', title = 'Poverty Percent', palette = "Greens")
```


# Time-Series Clustering 

It is important to note that, in our estimation, there are three important decisions to be made when performing time-series clustering:  

  - *Preparation of the Different Time-Series to be Clustered*  In this section, we have (a) selected the new daily cases per county as the primary variable of interest, (b) smoothed that variable using a seven-day moving average, and (c) scaled the observations within each county’s 7-day MA of new daily deaths such that it is bounded between 0 and 1. This allows us to compare the **shape** of the time-series/profile across counties of different populations and where the magnitude of the cases is quite different.
  
  - *Choice of Distance Measure: The Euclidean distance*, The Euclidean Distance i.e., the $l_2$ norm, is the most commonly used distance measure since it is computationally efficient. However, it may not be suitable for applications where the time-series are of different length in addition to being sensitive to noise, scale and time shifts [@sarda2017comparing].  
  
  - *Choice of Clustering Algorithm:* A large number of clustering algorithms have been proposed in the literature. Most common clustering approaches are shape-based, which include $k-$means clustering and hierarchical clustering. The reader is referred to @aghabozorgi2015time for a detailed review. In our preliminary analysis, we have chosen to use the hierarchical clustering approach since it provides an easy to understand dendogram and the number of counties was small. However, in our full analysis, we will use the $k-$means clustering algorithm since it is computationally efficient. Furthermore, we overcame the traditional limitation of having to pre-specify $k$ by utilizing 26 indexes for determining the optimal number of clusters in a data set based on the excellent approach and package implementation of @charrad2014NbClust.


## Data Preparation

```{r dataPrepClustering}
clusteringPrep = counties %>% # from the counties
  select(id, date, key_google_mobility, newCases) %>% # selecting minimal amount of cols for visual inspection
  arrange(id, date) %>% # arranged to ensure correct calculations
  mutate(newMA7 = rollmeanr(newCases, k = 7, fill = NA), # 7-day ma of new (adjusted) cases
         maxMA7 = max(newMA7, na.rm = T), # obtaining the max per county to scale data
         scaledNewMA7 = pmax(0, newMA7/maxMA7, na.rm = TRUE) ) %>% # scaling data to a 0-1 scale by county
  select(id, key_google_mobility, date, scaledNewMA7) %>% # dropping the variable newCases
  pivot_wider(names_from = date, values_from = scaledNewMA7) # converting the data to a wide format for clustering

constantColumns  = which_are_constant(clusteringPrep, verbose = F) # identifying constant columns
datesDropped = colnames(clusteringPrep)[constantColumns] # used for printing the names after the code chunk

clusteringPrep %<>% select(-all_of(constantColumns) ) %>%  # speeds up clustering by dec length of series
  as.data.frame() # data needs to be data frame for clustering
row.names(clusteringPrep) = clusteringPrep[,1] # needed for tsclust
clusteringPrep = clusteringPrep[,-1] # dropping the id column since it is now row.name
```

The following dates were removed from our data frame since the `scaledNEWMA7` variable was constant across all counties: `r pander(datesDropped, compact = TRUE)`.


## Clustering Contiguous U.S. Counties

```{r tsClustering, fig.show='hide'}
clusteringPrep %<>% select(-c(key_google_mobility))  # removing this variable so we can cluster

nc  = NbClust(clusteringPrep, distance = "euclidean", # euclidean distance
             min.nc = 2, max.nc = 49, # searching for optimal k between k=2 and k=49
             method = "kmeans", # using the k-means method
             index = "all") # using 26 of the 30 indices in the package

kclus  = nc$Best.partition %>% as.data.frame() %>% #obtaining the best partition/ cluster assignment for optimal k
  rename(., cluster_group = .) %>% rownames_to_column("County") 

#converting the wide to tall data and adding the cluster groupings
clusters  = clusteringPrep %>% 
  rownames_to_column(var = "County") %>% 
  pivot_longer(cols = starts_with("2020"), names_to = "Date") %>% 
  inner_join(., kclus, by = "County") %>% 
  mutate(cluster_group = as.factor(cluster_group))

idClusters  = clusters %>% select(c(County, cluster_group)) # creating a look-up table of county and cluster group
colnames(idClusters)  = c('id', 'cluster_group') # renaming the columns
idClusters %<>%  unique() #removing the duplicates due to different dates (we had that to ensure that the clustering was applied correctly)

# Adding Cluster Grouping to a subset of the counties data frame
clusterCounties = counties %>% 
  select(c(id, key_numeric, key_google_mobility, administrative_area_level_2, administrative_area_level_3)) %>% 
  inner_join(., idClusters, by ='id') %>% 
  mutate(cluster_group = paste0('C', cluster_group)) %>% 
  unique()

# saving the results as a RDS File
saveRDS(clusterCounties, '../Data/clusterCounties.rds')
```


## Visualizing the Clustering Results

In this subsection, we provide three plots:  

  - A paneled spaghetti plot, highlighting the median scaled time-series for profile for each cluster;  
  - A panel plot where the first, second and third quartiles of the scaled time-series for each cluster are compared; and  
  - An interactive chloropleth maps to visualize the spatial distribution of the clusters, where the reader can click on a given county to show: (a) county name, (b) assigned cluster, (c) population density, and (d) percentage of residents in poverty.


### Spaghetti Plot

```{r spaghetti}
spaghettiDF = counties %>% # from the counties
  select(id, date, newCases, key_google_mobility) %>% # selecting minimal columns
  left_join(clusterCounties[, c('id', 'cluster_group')], by = 'id') %>% # to get clusters
  arrange(id, date) %>% # arranged to ensure correct calculations
  mutate(newMA7 = rollmeanr(newCases, k = 7, fill = NA), # 7-day ma of new (adjusted) deaths
         maxMA7 = max(newMA7, na.rm = T), # obtaining the max per county to scale data
         scaledNewMA7 = pmax(0, newMA7/maxMA7, na.rm = TRUE) ) %>% 
  ungroup() %>% select(date, cluster_group, scaledNewMA7, key_google_mobility) %>% 
  group_by(date, cluster_group)

spaghettiDF$cluster_group %<>% as.factor() 

# Creating a Named Color Scale
colorPal =  brewer.pal(n= levels(spaghettiDF$cluster_group) %>% length(), 'Set2')
names(colorPal) = levels(spaghettiDF$cluster_group)

# Saving spaghetti plot to an tiff file
tiff(filename = '../Figures/spaghettiPlot.tiff', width = 1366, height =768, pointsize = 16)
spaghettiDF %>%  
  ggplot(aes(x = date, y = scaledNewMA7, color = cluster_group, group = key_google_mobility)) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  geom_line(size = 0.25, alpha = 0.1) +
  stat_summary(aes(group = 1), 
               fun= median,
               geom = "line",
               size = 1.25, col = 'black') + 
  facet_wrap(~ cluster_group, ncol = 1) +
  theme(legend.position = 'none') + 
  labs(x = 'Month', y = 'Scaled New Cases By Cluster By Day',
       caption = paste0('Solid black line represents the median for each cluster | 
       Based on Data from March 01, 2020 - ', endDatePrintV) )  +
  scale_color_manual(values = colorPal)
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Printing a png version of the plot in Markdown (lower quality image for quicker compilation of HTML)
readTIFF("../Figures/spaghettiPlot.tiff") %>% grid.raster()
```


### Summary Plot
```{r summaryPlot}
# Creating a data frame containing statistical summaries of the time series by cluster_group
summaryDf = spaghettiDF %>% 
  summarise(Median = median(scaledNewMA7, na.rm= TRUE),
            `First Quartile` = quantile(scaledNewMA7, probs = 0.25, na.rm= TRUE),
            `Third Quartile` = quantile(scaledNewMA7, probs = 0.75, na.rm= TRUE)) %>% 
  pivot_longer(cols = c(`First Quartile`, Median, `Third Quartile`),
                        names_to = 'Statistic')

tiff(filename = '../Figures/summaryPlot.tiff', width = 1366, height =768, pointsize = 16)
summaryDf %>% 
  ggplot(aes(x = date, y = value, color = cluster_group, linetype =  Statistic)) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b") +
  geom_line(size = 1.25) +
   scale_linetype_manual(values = c('dotted', 'solid', 'twodash')) +
  facet_wrap(~ cluster_group, ncol = 1) +
  theme(legend.position = 'top') + 
  labs(color = '', x = 'Month', y = 'Quartiles of Scaled New Cases By Cluster By Day',
       caption = paste0('Based on Data from March 01, 2020 - ', endDatePrintV)) +
  scale_color_manual(values = colorPal)
invisible( dev.off() ) # to suppress the unwanted output from dev.off

# Printing a png version of the plot in Markdown (lower quality image for quicker compilation of HTML)
readTIFF("../Figures/summaryPlot.tiff") %>% grid.raster()
```

### Cluster Map

```{r clusterMap, out.width='100%'}
# Joining the clusterCounties results with the existing county simple features object (cty_sf)
clusterCounties$fips = str_pad(clusterCounties$key_numeric, width = 5, side = 'left', pad = '0')
clusterCounties %<>% ungroup()
cty_sf %<>% left_join(clusterCounties[, c('fips', 'cluster_group')], by = 'fips') # adding cluster_group to cty_sf

# Creating a static visual for the paper
tiff(filename = '../Figures/clusterMap.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('cluster_group', title = 'Cluster #', palette = colorPal) +
  tm_credits(paste0('Based on Data from March 01, 2020 - ', endDatePrintV), position=c("right", "bottom"))
invisible( dev.off() ) # to suppress the unwanted output from dev.off


# Creating an interactive visual Using the Leaflet Package
#### Creating a longlat projection (required by leaflet)
leaflet_sf = counties_sf("longlat") %>% filter(!state %in% c('Alaska', 'Hawaii')) # from albersua
leaflet_sf %<>% geo_join(crossSectionalData, by_sp= 'fips', by_df= 'fips') %>% 
  left_join(clusterCounties[, c('fips', 'cluster_group')], by = 'fips')

#### Setting the Color Scheme
leafletPal =  colorFactor('Set2', domain = leaflet_sf$cluster_group, na.color = "white")

#### The visual
leaflet(height=500) %>% # initializing the leaflet map
  setView(lng = -96, lat = 37.8, zoom = 3.8) %>% # setting the view on Continental US
  addTiles() %>% # adding the default tiles
  addPolygons(data = leaflet_sf, stroke = FALSE, fillColor = ~leafletPal(leaflet_sf$cluster_group), # adding the data
              weight = 2, opacity = 1, color = "white", dashArray = "3", fillOpacity = 0.7, # adding color specs
              popup = paste("County:", leaflet_sf$name, '<br>', 
                            "Cluster #:", leaflet_sf$cluster_group, '<br>',
                            "Population Density:", round(leaflet_sf$popDensity, 1), '<br>')) %>% #pop-up Menu
  addLegend(position = "bottomleft", pal = leafletPal, values =  leaflet_sf$cluster_group, 
            title = "Cluster #", opacity = 1) # legend formatting
```


# Explanatory Modeling of Cluster Assignments

In the previous section, we showed that by using solely a scaled and smoothed time series of daily cases per county, the counties are grouped into `r levels(leaflet_sf$cluster_group)` categories (whose time-series have distinct shapes based on the Euclidean distance measure). In this section, we attempt to model the factors that are associated with the cluster assignment.


## Descriptive Statistics

```{r skimmed}
multiClassDF = select(clusterCounties, id, cluster_group) %>% 
  left_join(crossSectionalData, by = 'id')  %>% 
  select(-c(administrative_area_level_2, administrative_area_level_3, id, key_numeric))

saveRDS(multiClassDF, '../Data/multiClassDF.rds') # saving the data

skim(multiClassDF) # printing a nice summary table of the data
```


## Boxplot By Cluster
```{r boxplotCluster}
multiClassDF %>% plot_boxplot(by = 'cluster_group', ncol = 2L, 
               ggtheme = theme_bw(),
               geom_boxplot_args = list('outlier.shape' = 1))
```


## Explanatory Modeling Using Multinomial Spatial Regression

### Data Preparation

```{r multiSpatial}
multiClassDF$cluster_group %<>% as.factor() # convert to a factor

# impute without using cluster_group, key_google_mobility and fips
multiClassImputed = VIM::kNN(multiClassDF, imp_var = FALSE,
                             dist_var = colnames(multiClassDF)[3:10])
saveRDS(multiClassImputed, '../Data/multiClassImputed.rds') # saving the data

# Creating a df (which will be used for analysis)
df = multiClassImputed # setting df to equal to the multiclass object
df$clustReLeveled =  relevel(df$cluster_group, ref = maxCat(df$cluster_group) ) # setting the ref level
df  = df %>% select(-c(cluster_group, # removed since it is now redundant with the clustReLeveled variable
                       key_google_mobility, # removed since they are identifier variables
                       icuBedsPer10000Seniors, percRepVotes)) # did not sig. improve predictions
saveRDS(df, '../Data/df.rds') # saving the data
```

### Model Building

```{r spatialModelBuild}
multiClassDF.sorted = df[order(df$fips), ]

# File needs to be downloaded from 
# https://www2.census.gov/geo/tiger/TIGER2016/COUNTY/tl_2016_us_county.zip
shape =  readOGR('J:/My Drive/Miami/Code/GitHub/covid/Data', "tl_2016_us_county")
shape_sp = st_as_sf(shape)

# Joining both the clustered data and the shape file in one object
joinedMultiClass = geo_join(shape_sp, multiClassDF.sorted, by_sp = 'GEOID', by_df ='fips',how = 'inner')
joinedMultiClass.sorted = joinedMultiClass[order(joinedMultiClass$GEOID), ]

# Building a neighborhood list from the shape data
nbList = poly2nb(joinedMultiClass.sorted) %>% 
  nb2listw(style = 'W', zero.policy = TRUE) # converting it to a listw object

# Identifying the number of neighbors per county
numCounties = length(nbList$neighbours) # number of counties
numNeighbors = rep(0, numCounties) # number of neighbors for each county
for (i in 1:numCounties)  numNeighbors[i] = length(nbList$neighbours[[i]])

# Identifying the fips of the neighbor counties
fipsOfNeighbourCounties = c() # initialization
for (i in 1:numCounties) fipsOfNeighbourCounties = c(fipsOfNeighbourCounties, nbList$neighbours[[i]] )

# Pre-nimble parameters
sumNumNeigh = length(fipsOfNeighbourCounties)
m = length(numNeighbors)

# Preparing the outcome
number_of_clusters = unique(multiClassDF.sorted$clustReLeveled) %>% length()
cluster_group_matrix = matrix( 0 , nrow = m, ncol = number_of_clusters)
cluster_group_nu = str_remove(multiClassDF.sorted$clustReLeveled, 'C') %>% as.numeric()

for (i in 1:m)
{
  cluster_group_matrix[ i , cluster_group_nu[i] ] = 1
}

## Storing each predictor in a vector
countyType = multiClassDF.sorted$countyType
Underserved = fastDummies::dummy_cols(countyType)
Underserved.rural = Underserved$`.data_Rural/Underserved` %>% as.matrix() %>% as.vector()
popDensity = multiClassDF.sorted$popDensity %>% as.matrix() %>% as.vector()
GovernmentResponseIndexMedian = multiClassDF.sorted$GovernmentResponseIndexMedian %>% as.matrix() %>% as.vector()
PercentSeniors = multiClassDF.sorted$PercentSeniors %>% as.matrix() %>% as.vector()
povertyPercent = multiClassDF.sorted$povertyPercent %>% as.matrix() %>% as.vector()
regions = multiClassDF.sorted$regions

results = dummy_cols(regions)
A = results$.data_A
B = results$.data_B
C = results$.data_C
D = results$.data_D
E = results$.data_E
FF = results$.data_F
G = results$.data_G
H = results$.data_H
I = results$.data_I



# Nimble Code spatial
code = nimbleCode(
  {
    for (i in 1:m)
    {
      cluster_group_matrix[i,1:number_of_clusters] ~ dmulti( prob = p[i,1:number_of_clusters] ,1) 
      phi[i,1] <- 1
      p[i,1] <- 1/sum(phi[i,1:number_of_clusters])
      for (k in 2:number_of_clusters)
      {
        log(phi[i,k]) <- b0[k] + b1[k]*Underserved.rural[i] + 
          b2[k]*popDensity[i] +
          b3[k]*PercentSeniors[i] + 
          b4[k]*GovernmentResponseIndexMedian[i] + 
          b5[k]*povertyPercent[i] + b6[k]*A[i] + b7[k]*B[i] + 
          b8[k]*C[i] + b9[k]*D[i] + b10[k]*E[i] + b11[k]*F[i] + b12[k]*G[i] + b13[k]*H[i] + b14[k]*I[i] + u[i]
        p[i,k] <- phi[i,k]/sum(phi[i,1:number_of_clusters])
      }
    }    
    for (k in 2:number_of_clusters) 
    {
      b0[k] ~ dnorm(0, 0.00001); b1[k] ~ dnorm(0, 0.00001); b2[k] ~ dnorm(0, 0.00001)
      b3[k] ~ dnorm(0, 0.00001); b4[k] ~ dnorm(0, 0.00001); b5[k] ~ dnorm(0, 0.00001)
      b6[k] ~ dnorm(0, 0.00001); b7[k] ~ dnorm(0, 0.00001); b8[k] ~ dnorm(0, 0.00001)
      b9[k] ~ dnorm(0, 0.00001); b10[k] ~ dnorm(0, 0.00001); b11[k] ~ dnorm(0, 0.00001)
      b12[k] ~ dnorm(0, 0.00001); b13[k] ~ dnorm(0, 0.00001); b14[k] ~ dnorm(0, 0.00001)
    }
    u[1:m] ~ dcar_normal(adj[1:sumNumNeigh], weights[1:sumNumNeigh], 
                         num[1:m],tauu)
    for (j in 1:sumNumNeigh)
    {weights[j] <- 1}
    tauu ~ dgamma(1,0.0001)
  }
)  

constants = list(num=numNeighbors, adj=fipsOfNeighbourCounties,
                  sumNumNeigh = length(fipsOfNeighbourCounties), 
                  m=m,number_of_clusters=number_of_clusters)

data = list(cluster_group_matrix = cluster_group_matrix, 
             Underserved.rural = Underserved.rural,
             popDensity = popDensity,
             PercentSeniors = PercentSeniors,
             GovernmentResponseIndexMedian = GovernmentResponseIndexMedian,
             povertyPercent = povertyPercent,
             A=A, B=B, C=C, D=D, E=E, F=FF,
             G=G, H=H, I=I)

inits = list(b0=rep(0, number_of_clusters), u=rep(0,m),tauu=1, b1=rep(0, number_of_clusters), 
             b2=rep(0, number_of_clusters), b3=rep(0, number_of_clusters), b4=rep(0, number_of_clusters),
             b5=rep(0, number_of_clusters), b6=rep(0, number_of_clusters), b7=rep(0, number_of_clusters), 
             b8=rep(0, number_of_clusters), b9=rep(0, number_of_clusters), b10=rep(0, number_of_clusters),
             b11=rep(0, number_of_clusters), b12=rep(0, number_of_clusters), b13=rep(0, number_of_clusters),
             b14=rep(0, number_of_clusters) )

Rmodel = nimbleModel(code=code, constants=constants, data=data, inits=inits)

compile.Rmodel = compileNimble( Rmodel )

monitors = c('b0','b1','b2','b3','b4','b5','b6','b7','b8',
              'b9','b10','b11','b12','b13','b14','p','tauu')

Rmodel.Conf = configureMCMC( Rmodel , monitors=monitors, thin = 100)

Rmodel.MCMC = buildMCMC( Rmodel.Conf )
compile.Rmodel.MCMC = compileNimble( Rmodel.MCMC )

niter = 300000
nburn = 150000

start.time = proc.time()
spatial.base6 = runMCMC( compile.Rmodel.MCMC, niter = niter, nburnin = nburn,
                         inits = inits, nchains = 1, samplesAsCodaMCMC = TRUE )
stop.time = proc.time()
time.elapsed = stop.time - start.time
print( time.elapsed )
```

## Resulting Model
```{r spatialModelCoef, results = 'asis'}
Betacoe = spatial.base6[, (1:60)] # based on current number of predictors
saveRDS(Betacoe, '../Data/betacoe.rds') # saving the data

# Computing the coefficients' values
coeffTable = rbind(colMeans(Betacoe), colSds(Betacoe))
rownames(coeffTable) = c('means', 'stdevs')

# Formating the output
coeffTable %<>% as.data.frame() %>% 
  select(paste0(paste0("b", rep(seq(0, 14), 4)), '[', 
                c(rep(1,15), rep(2,15), rep(3,15), rep(4,15)),
                ']' )) %>% # reordering cols by name
  select_if(~ !any(is.na(.)))  # dropping NA cols (corresponding to Cluster 1)

tCoeffTable = t(coeffTable)

tCoeffTable = cbind(tCoeffTable[1:15,], tCoeffTable[16:30,], tCoeffTable[31:45,]) %>% data.frame()
row.names(tCoeffTable) = c('constant', 'rural', 'popDensity', 'percSeniors',
                           'govResponse', 'percPoverty', 'regionA', 'regionB', 'regionC',
                           'regionD', 'regionE', 'regionF', 'regionG', 'regionH',
                           'regionI')
colnames(tCoeffTable) = c('C2_coef_mean', 'C2_coef_sd',
                          'C3_coef_mean', 'C3_coef_sd',
                          'C4_coef_mean', 'C4_coef_sd')

tCoeffTable %>% round(digits = 3) %>% datatable(
          extensions = c('FixedColumns', 'Buttons'), options = list(
            pageLength = 15,
            dom = 'Bfrtip',
            scrollX = TRUE,
            buttons = c('copy', 'csv', 'excel', 'pdf'),
            fixedColumns = list(leftColumns = 1)))
```

## Model's Performance

```{r spatialModelResults}
samples_p = spatial.base6[, -(1:60)]
samples_p_mean = colMeans(exp(samples_p[, 1:(number_of_clusters*m)]))
C1 = samples_p_mean[1:3108]
C2 = samples_p_mean[3109:6216]
C3 = samples_p_mean[6217:9324]
C4 = samples_p_mean[9325:12432]
pred.0 = cbind(C1, C2, C3, C4) 
pred = rep(NA, m)
for (i in 1:m) {
  vec = pred.0[i, 1:number_of_clusters]
  pred[i] = which.max(vec)
}

predicted.spatial = cbind(pred, multiClassDF.sorted[, 'fips']) 
colnames(predicted.spatial) = c('pred', 'fips')

predSpatialFinal = merge(predicted.spatial, multiClassDF.sorted[, c('clustReLeveled', 'fips')],
                         by = 'fips')

predSpatialFinal$clustReLeveled %<>%  str_remove('C')
saveRDS(predSpatialFinal, '../Data/predSpatialFinal.rds') # saving the data


# Computing the Confusion Metrics and By Class Metrics
confMatrix = confusionMatrix(as.factor(predSpatialFinal$pred), 
                             as.factor(predSpatialFinal$clustReLeveled))
saveRDS(confMatrix, '../Data/confMatrixSpatialModel.rds') # saving the data

# Printing the Resulting tables nicely
pander(confMatrix$table)
pander(confMatrix$byClass)
pander(confMatrix$overall)
```




## Visualizing the Model's Outcomes
```{r vizSpatialModel}
###Visualizing the Model’s Predictions
predSpatialFinal = readRDS('../Data/predSpatialFinal.rds') # saving the data
predSpatialFinal$match = ifelse(predSpatialFinal$pred == predSpatialFinal$clustReLeveled, "Yes", "No") %>%
  as.factor()

mapResults = predSpatialFinal
mapResults$fips %<>% as.factor()

# Retrieving the U.S. county composite map as a simplefeature (since it has been overwritten)
counties_sf = counties_sf("aeqd") %>% filter(!state %in% c('Alaska', 'Hawaii')) # from albersua
counties_sf %<>% geo_join(mapResults, by_sp= 'fips', by_df= 'fips')


# Creating a static visual for use in the paper
tiff(filename = '../Figures/clusterMatchMapSpatial.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(counties_sf) + tm_polygons('match', title = 'Cluster Match', style = 'cont', palette = "div") +
  tm_layout(aes.palette = list(div = list("Yes" = "#CAB2D6", "No" = "#6A3D9A"))) +
  tm_credits(paste0('Based on Data from March 01, 2020 - ', endDatePrintV), position=c("right", "bottom"))
invisible( dev.off() ) # to suppress the unwanted output from dev.off


# Creating a longlat projection (required by leaflet)
leaflet_sf = counties_sf("longlat") %>% filter(!state %in% c('Alaska', 'Hawaii')) # from albersua
leaflet_sf %<>% geo_join(mapResults, by_sp= 'fips', by_df= 'fips')

leafletPal = colorFactor(palette = c("#CAB2D6", "#6A3D9A"), levels = c('Yes', 'No'), na.color = "white")

leaflet(height = 500) %>% 
  setView(lng = -96, lat = 37.8, zoom = 3.8) %>%
  addTiles() %>%
  addPolygons(data = leaflet_sf, stroke = FALSE, fillColor = ~leafletPal(leaflet_sf$match),
              weight = 2, opacity = 1, color = "white", dashArray = "3", fillOpacity = 0.7,
              popup = paste("County:", leaflet_sf$name, '<br>', 
                            "Cluster #:", leaflet_sf$clustReLeveled, '<br>',
                            "Cluster Predicted:", leaflet_sf$pred, '<br>',
                            "Cluster Match:", leaflet_sf$match, '<br>')) %>% #pop-up Menu
  addLegend(position = "bottomleft", pal = leafletPal, values =  leaflet_sf$match, 
            title = "Cluster Match", opacity = 1)
```



---

# References
<div id="refs"></div>

---

# Appendices

## Appendix A: Packages Used
In the appendix, we print all the R packages used in our analysis and their versions to assist with reproducing our results/analysis.

```{r sessionInfo}
pander(sessionInfo(), compact = TRUE) # printing the session information
```

## Appendix B: Baseline Multinomial Model

### Model Building

```{r multiModel}
df = readRDS('../Data/df.rds') # loading the data
df %<>% select(-fips) # removed since it was only used in the spatial model 
finalModel =  quiet(multinom(clustReLeveled ~ ., data = df)) # building the multinomial model
```

### Resulting Model
```{r multiModelStargazerTex, echo=FALSE, results='hide'}
# Saving the results as a latex table, but not printing it out in the Markdown document
invisible(stargazer(finalModel, type = 'latex', p.auto = FALSE, out="../Data/multi.tex", 
                    single.row = TRUE, header = FALSE))
```

```{r multiModelStargazer, results='asis'}
# tabulating the model results as an HTML table, which we print below
stargazer(finalModel, type = 'html', p.auto = FALSE, out="../Data/multi.html", single.row = FALSE)
```

### Model's Performance

```{r multiModelPerf}
# examining how well the model performed on our entire dataset
# Recall that we are fitting an explanatory model, and not a predictive model
predictedClass = predict(finalModel, df)
saveRDS(finalModel, '../Data/finalModel.rds') # saving the data

# Computing the Confusion Metrics and By Class Metrics
confMatrix = confusionMatrix(predictedClass, df$clustReLeveled)
saveRDS(confMatrix, '../Data/confMatrix.rds') # saving the data

# Printing the resulting tables nicely
pander(confMatrix$table)
pander(confMatrix$byClass)
pander(confMatrix$overall)
```

### Visualizing the Model’s Predictions

```{r vizMultiModel}
predictedProbs = fitted(finalModel) # computing predicted probabilities for each of the cluster outcome levels
mapResults = cbind(multiClassDF, predictedProbs) # col binding predProbs for Each Cluster with multiClassDF

# Finding indices to subset the data
numberOfClusters = unique(mapResults$cluster_group) %>% as.character() %>% length() 
startCol = ncol(mapResults) - numberOfClusters + 1
endCol = ncol(mapResults)

# Finding whether the predicted and actual clusters matched for each county
mapResults$LargestProbCluster = colnames(mapResults[, startCol:endCol])[apply(mapResults[, startCol:endCol], 1, which.max)] 
mapResults$match = ifelse(mapResults$cluster_group == mapResults$LargestProbCluster, 'Yes', 'No') %>% as.factor()

# Retrieving the U.S. county composite map as a simplefeature (since it has been overwritten)
cty_sf = counties_sf("aeqd") %>% filter(!state %in% c('Alaska', 'Hawaii')) # from albersusa
cty_sf %<>% geo_join(mapResults, by_sp= 'fips', by_df= 'fips')

# Creating a static visual for use in the paper
tiff(filename = '../Figures/clusterMatchMap.tiff', width = 1366, height =768, pointsize = 16)
tm_shape(cty_sf) + tm_polygons('match', title = 'Cluster Match', style = 'cont', palette = "div") +
  tm_layout(aes.palette = list(div = list("Yes" = "#CAB2D6", "No" = "#6A3D9A"))) +
  tm_credits(paste0('Based on Data from March 01, 2020 - ', endDatePrintV), position=c("right", "bottom"))
invisible( dev.off() ) # to suppress the unwanted output from dev.off


# Creating an interactive visual Using the Leaflet Package
#### Creating a longlat projection (required by leaflet)
leaflet_sf = counties_sf("longlat") %>% filter(!state %in% c('Alaska', 'Hawaii')) # from albersua
leaflet_sf %<>% geo_join(mapResults, by_sp= 'fips', by_df= 'fips')

#### Setting the Color Scheme
leafletPal =  colorFactor(palette = c("#CAB2D6", "#6A3D9A"), levels = c('Yes', 'No'), na.color = "white")

#### The visual
leaflet(height=500) %>% # initializing the leaflet map
  setView(lng = -96, lat = 37.8, zoom = 3.8) %>% # setting the view on Continental US
  addTiles() %>% # adding the default tiles
  addPolygons(data = leaflet_sf, stroke = FALSE, fillColor = ~leafletPal(leaflet_sf$match), # adding the data
              weight = 2, opacity = 1, color = "white", dashArray = "3", fillOpacity = 0.7, # adding color specs
              popup = paste("County:", leaflet_sf$name, '<br>', 
                            "Cluster #:", leaflet_sf$cluster_group, '<br>',
                            "Cluster Predicted:", leaflet_sf$LargestProbCluster, '<br>',
                            "Cluster Match:", leaflet_sf$match, '<br>')) %>% #pop-up Menu
  addLegend(position = "bottomleft", pal = leafletPal, values =  leaflet_sf$match, 
            title = "Cluster Match", opacity = 1) # legend formatting
```
